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Abstract 

The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral 
recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. 
However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small 
numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, 
we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian 
genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n — \ 
chromosomes, an operation we call "threading." Using techniques based on hidden IVlarkov models, we can perform this 
threading operation exactly, up to the assumptions of the sequentially IVlarkov coalescent and a discretization of time. An 
extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading 
operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in 
a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the 
posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences 
generated under realistic parameters for human populations. In applications ARGweaver X.0 54 human genome sequences 
from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry 
associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe 
near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, 
although we cannot rule out a contribution from recurrent selective sweeps. 
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Introduction 

At each genomic position, orthologous DNA sequences drawn 
from one or more populations are related by a branching structure 
known as a genealogy [1,2]. Historical recombination events lead 
to changes in these genealogies from one genomic position to the 
next, resulting in a correlation structure that is complex, 
analytically intractable, and poorly approximated by standard 
representations of high-dimensional data. Over a period of many 
decades, these unique features of genetic data have inspired 
numerous innovative techniques for probabilistic modeling and 
statistical inference [3-9], and, more recendy, they have led to a 
variety of creative approaches that achieve computational 
tractabUity by operating on various summaries of the data [10- 
17]. Nevertheless, none of these approaches fuUy captures the 
correlation structure of collections of DNA sequences, which 
inevitably leads to limitations in power, accuracy, and generality in 
genetic analysis. 

In principle, the correlation structure of a collection of colinear 
orthologous sequences can be fully described by a network known 



as an ancestral recombination graph (ARG) [18-20]. An ARG provides 
a record of all coalescence and recombination events since the 
divergence of the sequences under study and specifies a complete 
genealogy at each genomic position (Figure lA). In many senses, 
the ARG is the ideal data structure for population genomic 
analysis. Indeed, if an accurate ARG could be obtained, many 
problems of interest today — such as the estimation of recombina- 
tion rates or ancestral effective population sizes — ^would become 
trivial, while many other problems — such as the estimation of 
population divergence times, rates of gene flow between popula- 
tions, or the detection of selective sweeps — would be greatiy 
simplified. Various data representations in wide use today, 
including the site frequency spectrum, principle components, 
haplotype maps, and identity by descent spectra, can be thought of 
as low-dimensional summaries of the ARG and are strictly less 
informative. 

An extension of the widely used coalescent framework [1,2,9] 
that includes recombination [21] is regarded as an adequately rich 
generative process for ARGs in most settings of interest. While 
simulating an ARG under this model is fairly straightforward. 
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Author Summary 

The unusual and complex correlation structure of popu- 
lation samples of genetic sequences presents a funda- 
mental statistical challenge that pervades nearly all areas 
of population genetics. Historical recombination events 
produce an intricate network of intertwined genealogies, 
which impedes demography Inference, the detection of 
natural selection, association mapping, and other applica- 
tions. It is possible to capture these complex relationships 
using a representation called the ancestral recombination 
graph (ARG), which provides a complete description of 
coalescence and recombination events In the history of 
the sample. However, previous methods for ARG inference 
have not been adequately fast and accurate for practical 
use with large-scale genomic sequence data. In this article, 
we introduce a new algorithm for ARG inference that has 
vastly improved scaling properties. Our algorithm is 
implemented in a computer program called ARGweaver, 
which is fast enough to be applied to sequences 
megabases in length. With the aid of a large computer 
cluster, ARGweaver can be used to sample full ARGs for 
entire mammalian genome sequences. We show that 
ARGweaver performs well in simulation experiments and 
demonstrate that It can be used to provide new insights 
about both demographic processes and natural selection 
when applied to real human genome sequence data. 

however, using it to reconstruct an ARG from sequence data is 
notoriously difScult. Furthermore, tlie data are generally only 
weakly informative about the ARG, so it is often desirable to 
regard it as a "nuisance" variable to be integrated out during 
statistical inference (e.g., [22]). During the past two decades, 
various attempts have been made to perform explicit inference of 
ARGs using techniques such as importance sampling [19,22] (see 
also [23]) and Markov chain Monte Carlo sampling [24-27]. 
There is also a considerable literature on heuristic or approximate 
methods for ARG reconstruction in a parsimony framework [28- 
35]. Several of these approaches have shown promise, but they are 
generally highly computationally intensive and/or limited in 
accuracy, and they are not suitable for apphcation to large-scale 
data sets. As a result, explicit ARG inference is rarely used in 
applied population genomics. 

The coalescent-with-recombination is conventionally described 
as a stochastic process in time [21], but Wiuf and Hein [36] 
showed that it could be reformulated as a mathematically 
equivalent process along the genome sequence. Unlike the process 
in time, this "sequential" process is not Markovian because long- 
range dependencies are induced by so-called "trapped" sequences 
(genetic material nonancestral to the sample flanked by ancestral 
segments). As a result, the full sequential process is complex and 
computationally expensive to manipulate. Interestingly, however, 
simulation processes that simply disregard the non-Markovian 
features of the sequential process produce collections of sequences 
that are remarkably consistent in most respects with those 
generated by the full coalescent-with-recombination [37,38]. In 
other words, the coalescent-with-recombination is almost Mar- 
kovian, in the sense that the long-range correlations induced by 
trapped material are fairly weak and have a minimal impact on the 
data. The original Markovian approximation to the full process 
[37] is known as the sequentially Markov coalescent (SMC), and an 
extension that allows for an additional class of recombinations [38] 
is known as the SMC. 

In recent years, the SMC has become favorite starting point for 
approximate methods for ARG inference [39-42] . The key insight 
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Figure 1. An ancestral recombination graph (ARG) for four 
sequences. (A) Going backwards in time (from bottom to top), the 
graph shows how lineages that lead to modern-day chromosomes 
(bottom) either "coalesce" into common ancestral lineages (dark blue 
circles), or split into the distinct parental chromosomes that were joined 
(in forward time) by recombination events (light blue circles). Each 
coalescence and recombination event is associated with a specific time 
(dashed lines), and each recombination event is also associated with a 
specific breakpoint along the chromosomes (here, 62 and 63). Each non- 
recombining interval of the sequences (shown in red, green, and 
purple) corresponds to a "local tree" embedded in the ARG (shown in 
matching colors). Recombinations cause these trees to change along 
the length of the sequences, making the correlation structure of the 
data set highly complex. The ARG for four sequences is denoted in 
our notation. (B) Representation of in terms of a sequence of local 
trees T'* and recombination events R*. A local tree Tf is shown for each 
nonrecombining segment in colors matching those in (A). Each tree, Tf, 
can be viewed as being constructed from the previous tree, Tf_i, by 
placing a recombination event along the branches of r^'_, (light blue 
circles), breaking the branch at this location, and then allowing the 
broken lineage to re-coalesce to the rest of the tree (dashed lines in 
matching colors; new coalescence points are shown in gray). Together, 
the local trees and recombinations provide a complete description of 
the ARG. The Sequentially Markov Coalescent (SMC) approximate the 
full coalescent-with-recombination by assuming that T" is statistically 
independent of all previous trees given T" j. (C) An alignment of four 
sequences, D*, corresponding to the linearized ARG shown in (B). For 
simplicity, only the derived alleles at polymorphic sites are shown. The 
sequences are assumed to be generated by a process that samples an 
ancestral sequences from a suitable background distribution, then 
allows each nonrecombining segment of this sequence to mutate 
stochastically along the branches of the corresponding local tree. 
Notice that the correlation structure of the sequences is fully 
determined by the local trees; that is, D" is conditionally independent 
of the recombinations R" given the local trees T". 
doi:1 0.1 371/journal.pgen.l 004342.g001 
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behind these methods is that, if the continuous state space for the 
Markov chain (consisting of all possible genealogies) is approxi- 
mated by a moderately sized finite set — typically by enumerating 
tree topologies and/ or discretizing time — then inference can be 
performed efficiently using well-known algorithms for hidden 
Markov models (HMMs). Perhaps the simplest and most elegant 
example of this approach is the pairwise sequentially Markov 
coalescent (PSMC) [42], which applies to pairs of homologous 
chromosomes (typically the two chromosomes in a diploid 
individual) and is used to reconstruct a profile of effective 
population sizes over time. In this case, there is only one possible 
tree topology and one coalescence event to consider at each 
genomic position, so it is sulficient to discretize time and allow for 
coalescence within any of A: possible time slices. Using the resulting 
k-state HMM, it is possible to perform inference integrating over 
all possible ARGs. A similar HMM-based approach has been used 
to estimate ancestral effective population sizes and divergence 
times from individual representati\'es of a few closely related 
species [39-41]. Because of their dependency on a complete 
characterization of the SMC state space, however, these methods 
can only be applied to small numbers of samples. This limits their 
utility with newly emerging population genomic datasets and leads 
to reduced power for certain features of interest, such as recent 
effective population sizes, recombination rates, or local signatures 
of natural selection. 

An alternative modeling approach, with better scaling proper- 
ties, is the product of approximate conditionals (PAC) or 
"copying" model of Li and Stephens [43]. The PAC model is 
motivated primarily by computational tractabUity and is not based 
on an explicit evolutionary model. The model generates the «th 
sequence in a collection by concatenating (noisy) copies of 
fragments of the previous n — l sequences. The source of each 
copied fragment represents the "closest" (most recentiy diverged) 
genome for that segment, and the noise process allows for 
mutations since the source and destination copies diverged. The 
PAC framework has been widely used in many apphcations in 
statistical genetics, including recombination rate estimation, local 
ancestry inference, haplotype phasing, and genotype imputation 
(e.g., [44-48]), and it generally olfers good performance at 
minimal computational cost. Recently, Song and colleagues have 
generalized this framework to make use of conditional sampling 
distributions (CSDs) based on models closely related to, and in 
some cases equivalent to, the SMC [49-52]. They have 
demonstrated improved accuracy in conditional likelihood calcu- 
lations [49,50] and have shown that their methods can be effective 
in demographic inference [51,52]. However, their approach 
avoids explicit ARG inference and therefore can only be used to 
characterize properties of the ARG that are directiy determined by 
model parameters (see Discussion). 

In this paper, we introduce a new algorithm for ARG inference 
that combines many of the benefits of the small-sample SMC- 
based approaches and the large-sample CSD-based methods. Like 
the PSMC, our algorithm requires no approximations beyond 
those of the SMC and a discretization of time, but it improves on 
the PSMC by allowing multiple genome sequences to be 
considered simultaneously. The key idea of our approach is to 
sample an ARG of n sequences conditional on an ARG of w — 1 
sequences, an operation we call "threading." Using HMM-based 
methods, we can efficientiy sample new threadings from the exact 
conditional distribution of interest. By repeatedly removing and re- 
threading individual sequences, we obtain an efficient Gibbs 
sampler for ARGs. This basic Gibbs sampler can be improved by 
including operations that rethread entire subtrees rather than 
individual sequences. Our implementation of these methods. 



called ARGweaver, is efficient enough to sample fiall ARGs on a 
genome-wide scale for dozens of diploid individuals. Simulation 
experiments indicate that ARGweaver converges rapidly and is able 
to recover many properties of the true ARG with good accuracy. 
In addition, our explicit characterization of the ARG enables us to 
examine many features not directly described by model param- 
eters, such as local times to most recent common ancestry, allele 
ages, and gene tree topologies. These quantities, in turn, shed light 
on both demographic processes and the influence of natural 
selection across the genome. For example, we demonstrate, by 
applying ARGweaver to 54 individual human sequences from 
Complete Genomics, that it provides insight into the sources of 
reduced nucleotide diversity near functional elements, the 
contribution of balancing selection to regions containing very 
old polymorphisms, and the relative influences of direct and 
indirect selection on allele age. Our ARGweaver soitwdrc (https:// 
github.com/mdrasmus/argweaver), our sampled ARGs (http:/ / 
compgen.bscb.comell.edu/ARGweaver/ CG_results), and genome- 
browser tracks summarizing these ARGs (http:// genome-mirror.bscb. 
comell.edu; assembly hgl9) are all freely available. 

Results 

The Sequentially Markov Coalescent 

The starting point for our model is the Sequentially Markov 
Coalescent (SMC) introduced by McVean and Cardin [37]. We 
begin by briefly reviewing the SMC and introducing notation that 
will be useful below in describing a general discretized version of 
this model. 

The SMC is a stochastic process for generating a sequence of 
local trees, T" = T",...,T^ and corresponding genomic break- 
points b = b\, . . . , bm+i, such that each T"{\<i<m) describes 
the ancestry of a collection of n sequences in a nonretximbining 
genomic interval \bi,bi+{), and each breakpoint hi between 
intervals 7?_j and T" corresponds to a recombination event 
(Figure IB). The model is continuous in both space and time, with 
each node v in each Tf having a real-valued age ?(v)>0 in 
generations ago, and each breakpoint bj falling in the continuous 
interval [0, L\, where L is the total length of the genomic segment 
of interest in nucleotide sites. The inter\'als are exhaustive and 
nonoverlapping, with b\=Q, b„+i=L, and bi<bi+i for all 
Each T" is a binary tree with t{v) = 0 for all leaf nodes v. We wiU 
use the convention of indexing branches in the trees by their 
descendant nodes; that is, branch v is the branch between node v 
and its parent. 

As shown by Wiuf and Hein [36], the correlation stmcture of 
the local trees and recombinations under the full coalescent-with- 
recombination is complex. The SMC approximates this distribu- 
tion by assuming that T" is conditionally independent of 
T",. . . ,T"_2 given T"_^, and, similarly, that bf depends only on 
bi-i and 7i_i, so that. 



p(r , b\N, p) 

=p(rf|7V) 

P(,b„,+,=L\b^,T"J 



U P(b.\b._uTli)P(T^\Tl^,N, p) 

1=1 



where N is the effective population size, p is the recombination 
rate, and it is understood that ^>i=0. Thus, the SMC can be 
viewed as generating a sequence of local trees and corresponding 
breakpoints by a first-order Markov process. The key to the model 
is to define the conditional distributions P(Z),|6,_i,T'"_j) and 
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P{T"\T"_i, N, p) such that this Markov process closely approx- 
imates the coalescent-with-recombination. Briefly, this is accom- 
plished by first sampling the initial tree T" from the standard 
coalcsccnt and setting hi =0, and then iteratively (i) dc't(;rmining 
the next breakpoint, hi, by incrementing bj^i by an exponential 
random variate with rate /9|7?_]|, where 17?! denotes the total 
branch length of T"; [n] sampling a recombination point 
^i = ('V;,M,) uniformly along the branches beneath the root of 
T"_ J , where w, is a branch and w, is a time along that branch; (iii) 
dissolving the branch vf,- above point and (iv) allowing w, to 
rejoin the remainder of tree T"_ j above time M; by the standard 
coalescent process, creating a new tree T" (Figure IB). As a 
generative process for an arbitrary number of genomic segments, 
the SMC can be implemented by simply repeating the iterative 
process until hi>L, then setting m equal to i—\ and im+l equal 
to L. 

Notice that, if the sampled recombination points Ri are 
retained, this process generates not only a sequence of local trees 
but a complete ARG. In addition, a sampled sequence of local 
trees, T", is sufficient for generation of « aligned DNA sequences 
corresponding to the leaves of the trees (Figure IC). Augmented in 
this way, the SMC can be considered a fuU generative model for 
ARGs and sequence data. 

The Discretized Sequentially Markov Coalescent 

We now define an approximation of the SMC that is discrete in 
both space and time, which we call the Discretized Sequentially 
Markov Coalescent (DSMC). The DSMC can be viewed as a 
generalization to multiple genomes of the discretized pairwise 
sequentially Markov coalescent (PSMC) used by Li and Durbin 
[42]. It is also closely related to several other recendy described 
discretized Markovian coalescent models [39,40,50]. 

The DSMC assumes that time is partitioned into K intervals, 
whose boundaries are given by a sequence of time points 
V = {S(i,...,sk), with *o = 0, Sj+\>Sj for all j {0<j<K), and Sk 
equal to a user-specified maximum value. (See Table 1 for a key to 
the notation used in this paper.) Every coalescence or recombi- 
nation event is assumed to occur precisely at one of these ^-1-1 
time points. Various strategies can be used to determine these time 
points (see, e.g., [50]). In this paper, we simply distribute them 
uniformly on a logarithmic scale, so that the resolution of the 
discretization scheme is finest near the leaves of the ARG, where 
the density of events is expected to be greatest (see Methods). Each 
local block is assumed to have an integral length measured in base 
pairs, with all recombinations occurring between adjacent 
nucleotides. The DSMC approaches the SMC as the number of 
intervals K and the sequence length L grow large, for fixed and 
P- 

Like the SMC, the DSMC generates an ARG G" for n (haploid) 
sequences, each containing L nucleotides (Figure IB). In the 
discrete setting, it is convenient to define local trees and 
recombination events at the level of individual nucleotide 
positions. Assuming that R" denotes a recombination between 
r«^i and T^, we write G" = {T",R"), widi J" =(rf ,...,r«) for 
positions \ ,. . .,L, and K' = {R^,...,R!£^. Notice that it is possible in 
this setting that R" = 0 and T" = T"_ j . Where a recombination 
occurs {R" # 0), we write R" = (h',-,m,) where w,- is the branch in 
r"_j and MieP is the time- point of the recombination. For 
simphcity and computational efficiency, we assume that at most 
one recombination occurs between each pair of adjacent sites. 
Given the sparsity of variant sites in most data sets, this 
simplification is likely to have, at most, a minor effect during 
inference (see Discussion). 



Like the SMC, the DSMC can additionally be used to generate 
an alignment of DNA sequences (Figure IC). We denote such an 
alignment by D" = (D" , . . . , D^), where each D" represents an 
alignment column of height n. Each D" can be generated, in the 
ordinary way, by sampling an ancestral allele from an appropri- 
ate background distribution, and then allowing this allele to 
mutate stochastically along the branches of the corresponding 
local tree, in a branch-length-dependent manner. We denote the 
induced conditional probability distribution over alignment 
columns by P(D'l\T" where )i is the mutation rate. In this 
work, we assume a Jukes-Cantor model [53] for nucleotide 
mutations along the branches of the tree, but another mutation 
model can easily be used instead. Notice that, while the 
recombinations R" are required to define the ARG completely, 
the probability of the sequence data given the ARG depends only 
on the local trees T". 

The Threading Problem 

In the case of an observed alignment. If, and an unobserved 
ARG, G" = (T",R"), file DSMC can be viewed as a hidden 
Markov model (HMM) with a state- sjiace giv(;n by all possible 
local trees, transition probabilities given by expressions of the form 
P{R"i\T^_^, p) P{T^\Rl,T1_i,N), and emission probabUities given 
by the conditional distributions for alignment columns, 
P(D^\T",^). The complete data likelihood function of this 
model— that is, the joint probability of an ARG G" = (T", R") 
and a sequence alignment /)" given model parameters 
0 = (/i, p, AO — can be expressed as a product of these terms over 
alignment positions (see Methods for further details): 

P(T", BT, D"|©) 

L 

= p(T"i\N)P(Di\Ti li) n p(j?«|rf_i, p) p{ti\r;i, r«_i, 7V)(2) 
pmri, IX). 

This HMM formulation is impractical as a framework for direct 
inference, however, because the set of possible local trees — and 
hence the state space — grows super-exponentially with n. Even 
with additional assumptions, similar approaches have only 
been able to accommodate small numbers of sequences 
[32,35,54]. 

Instead, we use an alternative strategy with better scaling 
properties. The key idea of our approach is to sample the ancestry 
of only one sequence at a time, while conditioning on the ancestry 
of the other w — 1 sequences. Repeated applications of this 
"threading" operation form the basis of a Markov chain Monte 
Carlo sampler that explores the posterior distribution of ARGs. In 
essence, the threading operation adds one branch to each local 
tree in a manner that is consistent with the assumed recombination 
process and the observed data (Figure 2). While conditioning on a 
given set of local trees introduces a number of technical challenges, 
the Markovian properties of the DSMC are retained in the 
threading problem, and it can be solved using standard dynamic 
programming algorithms for HMMs. 

The threading problem can be precisely described as follows. 
Assume we are given an ARG for n—\ sequences, a 
corresponding data set D"^^, and a set of model parameters 
& = {fl, p, N). Assume further that G"^' is consistent with the 
assumptions of the DSMC (for example, all of its recombination 
and coalescent events occur at time points in V and it contains at 
most one recombination per position). Finally, assume that we are 
given an «th sequence d, of the same length of the others, and let 
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Table 1. Key to notation. 



Population Genetic Parameters 





Mutation rate, in events per site per generation 


p 


Recombination rate, in events per site per generation 


N 


Effective population size, in number of individuals^ 



0 Full parameter set, © = (/(, p, A'^) 



Time Discretization 


K 


Total number of time intervals (user-defined) 




Time point j {0<j<K), defining a boundary between time intervals (generations before present) 


Asj 


Length of yth time interval, \sj = si+i—Sj 


^J+\ 


Midpoint of yth time interval 


B(TJ) 


Set of branches in a tree T associated with time interval J 


B, 


Number of branches associated with time interval j, Bj = \B{TJ)\ (with T determined by context) 


A(T,j) 


Set of "active" branches at time point J 


Ai 


Number of "active" branches at time point j, Aj = \A{T,J)\ (with T determined by context) 


Ancestral Recombination Graph 


L 


Length of analyzed sequence alignment in nucleotides 


n 


Number of sequences in alignment 




Alignment column at /th position; cumulatively, D" = (D" ,iy[) 




Local tree for /th position; cumulatively, T" =(T", . . . ,T'l) 


Recombination point between / — 1st and ixh position; cumulatively, R" = {R2, . . . , R"i) 


G" 


Full ARG for n sequences, G" = {T'\ R") 


JV ={»/,'/) 


Coalescence point for threaded sequence at /th position, defined by a branch w, and a time point r, 
cumulatively, Y = {y\, . . . , yt) 


;,- = ("',-. Ui) 


Recombination point for threaded sequence between positions /— 1 and /, defined by a branch if, 
and a time point w,; cumulatively, Z = (z2, . . . , r^) 


Hidden Markov Model 




Transition probability from state / to state m between position / and /+1 




Initial state probability for state / 


b)(ir.) 


Emission probability for alignment column D" in state / at position / 



^IVlodei allows for a separate A^, for each time interval I but all analyses in this paper assume a constant across time intervals. 
doi:1 0.1 371 /journal.pgen.l 004342.t001 



D" = (iy ',(/). The threading problem is to sample a new ARG 
G" from the conditional distribution P(G"\G"-\D" ,&) under the 
DSMC. 

The problem is simplified by recognizing that G" can be defined 
by augmenting G"^' with the additional recombination and 
coalescence events required for the «th sequence. First, let G"^' 
be represented in terms of its local trees and recombination points: 
G"^' =(T"^', /f"^'). Now, observe that specifying the new 
coalescence events in G"^' is equivalent to adding one branch to 
each local tree, T"^^ for fe{l, . . . ,L}, to obtain a new tree T" 
(Figure 2). Let us denote the point at which each of these new 
branches attaches to the smaller subtree at each genomic position / 
by yi = (Xi,ti), where x,- indicates a branch in 7""^' and tjeV 
indicates the coalescence time along that branch. Thus, the 
coakscence threading of the nth sequence is given by the sequence 

To complete the definition of G", we must also specify the 
precise locations of the additional recombinations associated with 
the threading — that is, the specific time point at which each 



branch in a local tree T)-! was broken before the branch was 
allowed to re-coalesce in a new location in tree J). Here it is 
useful to partition the recombinations into those that are given by 
G"^', denoted /f"^', and those new to G", which we denote 
Z = (z\, . . . ,zl) (Figure 3A&B). Each z,- is either null (z, = 0), 
meaning that there is no new recombination between T" ^ and 
T", or defined by = where w, is a branch in T" ^ and 

UieV is the time along that branch at which the recombination 
occurred. We call Z the recombination threading of the nth sequence. 
For reasons of efficiency, we take a two-step approach to 
threading: first, we sample the coalescence threading Y, and 
second, we sample the recombination threading Z conditional on 
y. This separation into two steps allows for a substantially 
reduced state space during the coalescence threading operation, 
leading to significant savings in computation. When sampling the 
coalescence threading (step one), we integrate over the locations 
of the new recombinations Z, as in previous work [42,50]. 
Sampling the recombination threading (step two) can be 
accomplished in a straightforward manner independently for 
each recombination event, by taking advantage of the conditional 
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2 3 4 




Figure 2. The "threading" operation. The threading operation adds an «th sequence to an ARG of « — 1 sequences under a discretized version of 
the SMC (the DSIVIC) that requires all coalescence and recombination events to occur precisely at pre-defined time points, so,...sk (horizontal 
dashed lines). In this example, the fourth sequence has been removed from ARG from Figure 1, leaving a tree with « — 1 = 3 leaves at each position 
i (T"^'; shown in black). The fourth sequence (shown in red) is re-threaded through the remaining portion of the ARG by a two-step process that first 
samples a coalescence point y, for this sequence at each T"^' (dark blue points), thereby defining a new tree Tf, and second, samples a 
recombination point r, to reconcile each adjacent pair of trees, (T" ^,Tf ) (light blue points). For simplicity, only the distinct local trees for the four 
nonrecombining segments (after threading) are shown. The gray box highlights the pair of trees immediately flanking the breakpoint 63. Notice that 
the first recombination from Figure 1 is retained (dark gray nodes and dashed line in left-most tree). In general, new recombinations are prohibited at 
the locations of "given" recombinations R!'^^ (see text). Note that it is possible for the attachment point of the nth sequence in the local trees to 
move due to old recombinations as well as new ones (not shown in this example). 
doi:1 0.1 371/journal.pgen.1 004342.g002 



independence structure of the DSMC model (see Methods for 
details). 

The core problem, then, is to accomplish step one by sampling 
the coalescence threading Y from the distribution. 



P{Y\T"-\ R"-\ir,&)ccP(YJ"-\ R" \ Df'\@) 
= P{T'l-\yi\N)P(b\\Tl-\yu^i) 



(3) 



L 

n 

i=2 



n P(R'-\ f1-\y\f1zl,yi_,, p, N) P(D".\T:' 



where the notation A indicates that random variable A is held 
fixed ("clamped") at a particular value throughout the procedure. 
This equation defines a hidden Markov model with a state space 
given by the possible values of each yi, transition probabilities 
given by fl}„, =P(^r l,ff-l, =m!f;T/,>',-i =/,p,iV) and 
emission probabilities given by h'^j(D") = P(D"\T"^^ , yi = m, ^) 
(Figure 3C). Notice that the location of each new recombination, 
z,-, is imphcitly integrated out in the defmition of Despite 
some unusual features of this model — for example, it has a 
heterogeneous state space and normalization structure along the 
sequence — its Markovian dependency structure is retained, and 
the problem of drawing a coalescent threading Y from the desired 
conditional distribution can be solved exactly by dynamic 
programming using the stochastic traceback algorithm for HMMs. 
Additional optimizations allow this step to be completed in time 
linear in both the number of sequences n and the alignment length 
L and quadratic only in the number of time intervals K (see 
Methods for details). 

Markov Chain Monte Carlo Sampling 

The main value of the threading operation is in its usefulness as 
a building block for Markov chain Monte Carlo methods for 
sampling from an approximate posterior distribution over ARGs 
given the data. We employ three main types of sampling 
algorithms based on threading, as described below. 



Sequential sampling. First, the threading operation can be 
applied iteratively to a series of orthologous sequences to obtain an 
ARG of size n from sequence data alone. This method works by 
randomly choosing one sequence and constructing for it a trivial 
ARG (i.e. every local tree is a single branch). Additional 
sequences are then threaded into the ARG, one at a time, until an 
ARG G" of n sequences has been obtained. Notice that an ARG 
derived in this manner is not a valid sample from the posterior 



distribution, because each successive G (for ke{2, . 
sampled conditional on only Z)'*^ (the first k 
Nevertheless, the sequential sampling algorithm is 
heuristic method for obtaining an initial ARG, 
subsequently be improved by other methods. If 



■1}) is 



sequences), 
an efficient 
which can 
desired, this 



operation can be applied multiple times, possibly with various 
permutations of the sequences, to obtain multiple initializations of 
an MCMC sampler. Heuristic methods can also be used to choose 
a "smart" initial ordering of sequences. For example, one might 
begin with one representative of each of several populations, to 
first approximate the overall ARG structure, and subsequently add 
more representatives of each population. 

Gibbs sampling for single sequences. Second, the thread- 
ing operation can serve as the basis of a Gibbs sampler for fuU 
ARGs. Starting with an initial ARG of n sequences, individual 
sequences can be removed, randomly or in round-robin fashion, 
and rethreaded. Since the threading procedure samples from the 
conditional distribution P(G"\G"^^ , Z)",0), this produces a valid 
Gibbs sampler for the ARG up to the assumptions of the DSMC. 
The ergodicity of the Markov chain follows, essentially, from the 
fact that any tree is reachable from any other by a finite sequence 
of branch removals and additions (see Text SI for details). 

The main limitation of this method is that it leads to poor 
mixing when the number of sequences grows large. The essential 
problem is that rethreading a single sequence is equivalent to 
resampling the placement of external branches in the local trees, so 
this method is highly inefficient at rearranging the "deep 
structure" (internal branches) of the ARG. Furthermore, this 
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Figure 3. Graphical models for Discretized Sequentially 
Markov Coalescent (DSMC) models. (A) Full DSMC model for n 
samples with local trees, T" = {T" , . . .T'[), recombinations, 
R" = (R", . . . R'[), and alignment columns, D" = {D", . . . D"^^). Together, 
T" and R" define an ancestral recombination graph, G". Solid circles 
indicate observed variables and empty circles indicate latent variables. 
Arrows indicate direct dependencies between variables and correspond 
to conditional probability distributions described in the text. Notice that 
the R" variables can be integrated out of this model, leading to the 
conventional graph topology for a hidden Markov model. (B) The same 
model as in (A), but now partitioning the latent variables into 
components that describe the history of the first « — 1 sequences 
(J"^' and R"^') and components specific to the «th sequence 
(y = (j'i, . . . jV'i) and Z = (ri, . . . ,zl)). The J"^' and R" ' variables are 
represented by solid circles because they are now "clamped" at specific 
values. A sample of (Y,Z) represents a threading of the Kth sequence 
through the ARG. (C) Reduced model after elimination of Z by 
integration, enabling efficient sampling of coalescent threadings Y. 
This is the model used by the first step in our two-step sampling 
approach. In the second step, the Z variables are sampled conditional 
on y, separately for each r,. In this model, the grouped nodes have 
complex joint dependencies, leading to a heterogeneous state space 
and normalization structure, but the linear conditional independence 
structure of an HMM is retained. 
doi:1 0.1 371 /journal.pgen.1 004342.g003 

mixing problem becomes progressively worse as n grows larger. 
Indeed, as n approaches infinity, the single-sequence threading 
operation reduces to a procedure that selects a sequences of short 
genealogy "tips" leading to other sequences in the data set, leaving 
all other aspects of the ARG unchanged; in effect, it approaches 
the "copying" model of Li and Stephens [43]. As a result, an 
alternative strategy for ARG sampling is needed for large numbers 
of sequences. 

Subtree sampling. The third sampling strategy addresses the 
mixing limitations of the single-sequence Gibbs sampler by 



generalizing the threading operation to accommodate not only 
individual sequences but subtrees with arbitrary numbers of leaves. 
As a result, internal branches in the local trees can be resampled 
and the full ARG can be perturbed, including the deep branches 
near the roots of the local trees. 

In principle, one could address the subtree threading problem 
by arbitrarily selecting an internal branch for each nonrecombin- 
ing segment of the ARG and resampling its attachment point to 
the remainder of the tree, by essentially the same procedure used 
for the reattachment of external branches in single-sequence 
threading. The problem is that, because the local trees change 
along the sequence, it is impossible in general to select a sequence 
of internal branches whose subtrees are maintained across the 
entire ARG (this is possible only for external branches). 
Furthermore, if a poor sequence of internal branches is selected, 
the attachment points at both ends of each segment will be 
constrained by the flanking local trees, creating a strong tendency 
to resample the original attachment points, which would result in 
poor mixing of the sampler. 

To address this problem, we devised a novel method for 
selecting sequences of subtrees guaranteed to have good continuity 
properties. Once such a sequence is selected, the subtree threading 
operation can be accomplished efiiciendy using the stochastic 
traceback algorithm, in a similar manner as with single sequences. 
Our algorithm for selecting sequences of internal branches is fairly 
technical in nature and a detailed description is left for Text SI. 
Briefly, to select sequences of subtrees, we use a data structure 
called a branch graph, which traces the shared ancestry among 
branches across genomic positions. Using dynamic programming, 
we are able to identify paths through the branch graph that 
correspond to sequences of internal branches with good continuity 
properties. After a sequence of internal branches is identified, tiie 
selected branch is removed from each local tree, splitting it into a 
main tree and a subtree. A new branch is then added above the 
root of every subtree and allowed to re-coalesce with the 
corresponding main tree in a manner consistent with the DSMC. 

One important limitation of the algorithm is worth noting. As in 
the single-sequence case, the stochastic traceback algorithm 
samples from the desired conditional distribution over subtree 
threadings. However, since the number of ways of removing 
internal branches depends on the current structure of the ARG, 
the Hastings ratio is not equal to one in this case, and a more 
general Metropolis-Hastings algorithm (with rejection of some 
proposed threadings) is required (see Text SI for details). In 
practice, the acceptance rates for proposed threadings are fairly 
high (~40% for typical human data), and despite this limitation, 
Metropolis-Hastings subtree threading considerably improves the 
mixing properties of the Gibbs sampler for moderately large values 
of n (see below). 

ARGweaver Program and Visualization 

We implemented these sampling strategies in a computer 
program called ARGweaver, that "weaves" together an ARG by 
repeated applications of the threading operation. The program has 
subroutines for threading of both individual sequences and 
subtrees. Options allow it to be run as a Gibbs sampler with 
single-sequence threading or a general Metropolis-Hastings 
sampler with subtree threading. In either case, sequential sampling 
is used to obtain an initial ARG. Options to the program specify 
the number of sampling iterations and the frequency with which 
samples are recorded. The program is written in a combination of 
C-H- and Python and is reasonably well optimized. For example, it 
requires about 1 second to sample a threading of a single 1 Mb 
sequence in an ARG of 20 sequences with 20 time steps. Our 
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source code is freely available via GitHub (https://github.com/ 
mdrasmus/ argweaver). 

To summarize and visualize samples from the posterior 
distribution over ARGs, we use two main strategies. First, we 
summarize the sampled ARGs in terms of the time to most recent 
common ancestor (TMRCA) and total branch length at each 
position along the genome. We also consider the estimated age of 
the derived alleles at polymorphic sites, which we obtain by 
mapping the mutation to a branch in the local tree and calculating 
the average time for that branch (see Methods). We compute 
posterior mean and 95% credible int("r\'als for each of these 
statistics per genomic position, and create genome browser tracks 
that allow these values to be visualized together with other 
genomic annotations. 

Second, we developed a novel visualization device for ARGs 
called a "leaf trace." A leaf trace contains a line for each haploid 
sequence in an analyzed data set. These lines are ordered 
according to the local genealogy at each position in the genome, 
and the spacing between adjacent lines is proportional to their 
TMRCAs (Figure S2). The lines are parallel in nonrecombining 
segments of the genome, and change in order or spacing where 
recombinations occur. As a result, several features of interest are 
immediately evident from a leaf trace. For example, recombina- 
tion hot spots show up as regions with dense clusters of vertical 
lines, whereas recombination cold spots are indicated by long 
blocks of parallel Unes. 

Simulation Study 

Effects of discretization and convergence of 

sampler. Before turning to inference, we performed a series 
of preliminary experiments to verify that our discretization strategy 
allowed for an adequate fit to the data and that ARGweaver 
converged to a plausible posterior distribution for realistic 
simulated data sets. Briefly, we found that the DSMC produces 
similar numbers of recombination counts and segregating sites as 
the coalescent-with-recombination and SMC, when generating 
data under various recombination rates and effective population 
sizes (see Text SI and Supplementary Figure SI). With small 
numbers of sequences, the Gibbs sampler based on the single- 
sequence threading operation appeared to converge rapidly, 
according to both the log likelihood of the sampled ARG and 
the inferred numbers of recombination events. When the number 
of sequences grew larger than about 6-8 (depending on the specific 
details of the simulation), the Gibbs sampling strategy was no 
longer adequate. However, the subtree threading operation and 
Metropolis-Hastings sampler appeared to address this problem 
effectively, allowing the number of sequences to be pushed to 20 or 
more. With 20 sequences 1 Mb in length, the sampler converges 
within about 500 sampling iterations, which takes about 20 min- 
utes on a topical desktop computer (Supplementary Figure S3). 

Recovery of global ARG features. Next, we systematically 
assessed the ability of ARGweaver to recover several features of 
interest from simulated ARGs over a range of plausible ratios of 
mutation to recombination rates (see Methods for simulation 
parameters). In these experiments, we considered three "global" 
features of the ARG: (i) the log joint probability of the ARG and 
the data (log of equation 2), (ii) the total number of recombina- 
tions, and (iii) the total branch length of the ARG. We define the 
total branch length of the ARG to be the sum of the total branch 
lengths of the local trees at all sites (in generations), a quantity 
proportional to the expected number of mutations in the history of 
the sample. We applied ARGweaver to each simulated data set with 
500 burn-in iterations, followed by 1000 sampling iterations, with 
every tenth sample retained (100 samples total). 



We found that ARGweaver was able to recover the features of 
interest with fairly high accuracy at all parameter settings 
(Figure 4A and Supplementary Figure S4). In addition, the 
variance of our estimates is generally fairly low, but does show a 
clear reduction as fi/ p increases from 1 to 6, corresponding to an 
increase in the phylogenetic information per nonrecombining 
segment. Most current estimates of average rates would place the 
true value otfi/p for human populations between 1 and 2 [55-57], 
but the concentration of recombination events in hot spots implies 
that the ratio should be considerably more favorable for our 
methods across most of the genome. Notably, we do obsc-rve a 
slight tendency to under-estimate the number of recombinations, 
particularly at low values of fl/ p. This underestimation is paired 
with an over-estimation of the joint probability (left column), 
suggesting that it reflects model misspecification of the DSMC. It is 
possible that this bias could be improved by the use of the SMC 
rather than the SMC, or by a finer-grained discretization scheme 
(see Discussion). 

Recovery of local ARG features. An advantage of explicidy 
sampling full ARGs is that it enables inferences about local 

features of the ARG that are not directly determined by model 
parameters. Using the same simulated data and inference 
procedure as in the previous section, we evaluated the perfor- 
mance of ARGweaver in estimating three representative quantities 
along the genome sequence: (i) time to most recent common 
ancestry (TMRCA), (ii) recombination rate, and (iii) allele age. We 
estimated each quantity using an approximate posterior expected 
\aluc, computc'd by averaging acToss sampled ARGs. With 20 
sequences, we found that ARGweaver was able to recover the 
TMRCA with fairly high accuracy and resolution (Figure 4B). The 
quality of the estimates degrades somewhat at lower values of the 
ratio fi/ p but remains quite good even with fi/p=l (Supplemen- 
tary Figure S5). We found that our power for recombination rates 
was weak with only 20 sequences, but with 100 sequences the 
reconstructed ARGs clearly displayed elevated rates of recombi- 
nation in simulated hotspots compared with the flanking regions 
(Supplementary Figure S6). Estimates of allele ages appeared to be 
unbiased, with good concordance between true and estimated 
values, although the variance in the estimates was fairly high 
(Supplementary Figure S7, left [:olumn). Notably, the ARG-based 
estimates of allele age appear to be considerably better than 
estimates based on allele frequency alone (Supplementary Figure 
S7, right column). Together, these results suggest that, even with 
modest numbers of sequences, the distributions of ARGs inferred 
by our methods may be informative about loci under natural 
selection, local recombination rates, and other local features of 
evolutionar)' history. 

Accuracy of local tree topologies. In our next experiment, 
we evaluated the accuracy of ARGweaver in inferring the topology 
of the local trees, again using the same simulated data. The local 
trees are a more complex f(;ature of the ARG but are of particular 
interest for a[)plications such as genotype imputation and 
association mapping. For comparison, we also inferred local trees 
using the heuristic Margarita program [34], which is, to our 
knowledge, the only other published ARG-inference method that 
can be applied at this scale. In addition, we applied an 
unpublished method, called tree.sim (http://niallcardin.com/ 
treesim/index.html), that samples genealogies using heuristic 
extensions of the Monte Carlo methods of Fearnhead and 
Donnelly [22]. To compare these programs, we identified 100 
evenly spaced locations in our simulated data sets, and extracted 
the local trees reconstructed by all three methods at these 
positions. We found that ARGweaver produced more accurate local 
tree topologies than both Margarita and treesim across most values of 
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Figure 4. Simulation results. (A) Recovery of global features of simulated ARGs from sequence data. This plot is based on sets of 20 1-IVIb 
sequences generated under our standard simulation parameters (see IVIethods) with /(/p = 2 (see Supplementary Figure 10 for additional results). 
From left to right are shown true (.T-axis) versus inferred (j'-axis) values of the log joint probability (the logarithm of equation 2), the total number of 
recombinations, and the total branch length of the ARG. Each data point in each plot represents one of 100 simulated data sets. In the vertical 
dimension, circles represent averages across 100 sampled ARGs based on the corresponding data sets, sampled at intervals of 10 after a burn-in of 
200 iterations, and error bars represent the interval between the 2.5 and 97.5 percentiles. In the second and third plots, circles are interpretable as 
posterior expected values and error bars as 95% Bayesian credible intervals. (B) Posterior mean TIVIRCA (dark red line, with 95% credible intervals in 
light red) versus true TMRCA (black line) along a simulated genomic segment of 1 Mb. This plot is based on a single representative data set of 20 1- 
Mb sequences generated under our standard simulation parameters with ji/ p = 6 (see Supplementary Figure S5 for additional results). 
doi:1 0.1 371 /journal.pgen.1 004342.g004 



/j/p, except for the case of i.i/p=\, where treesim performed 
shghtly better (Supplementary Figure S8). The improvements were 
most pronounced at high p values, where topological informa- 
tion is greatest. In addition, the absolute accuracy of the trees 
inferred by ARGweaver was fairly high, given the sparseness of 
informative sites in these data sets. For example, at jij p = 6, more 
than 80% of predicted branches were correct and Maximum 
Agreement Subtree (MAST) percentages approached 75%, and 
even in the challenging case of p./p=\, over 60% of branches 
were correct and MAST percentages exceeded 50%. These results 
indicate that the sampler is effectively pooling information from 
many sites across the multiple alignment in making inferences 
about local tree topologies. 

Finally, we evaluated the accuracy ot ARGweaver's assessment of 
the uncertainty about the local trees given the data. We grouped 
individual branches into bins according to their estimated 
posterior probabilities (i.e., the fraction of sampled local trees in 
which each branch is found), and compared these values with the 
relative frequencies with which the same branches were observed 
in the true trees. We found that the predicted and actual 
probabilities of correctness were closely correlated, indicating that 
ARGweaver is accurately measuring the uncertainty associated with 
the local trees (Supplementary Figure S9). By contrast, the 
heuristic Margarita sampler shows a clear tendency to overestimate 
the confidence associated with branches in the local trees, often by 
10-20%. This comparison is not entirely fair, because the authors 
of Margarita do not claim that it samples from the posterior 
distribution, but it nevertheless highlights an important advantage 



of the Bayesian approach. Notably, the unpublished treesim 
program performed remarkably well on this test. 

Analysis of Real Data 

Having demonstrated that ARGweaver was able to recover many 
features of simulated ARGs with reasonable accuracy, we turned 
to an analysis of real human genome sequences. For this analysis 
we chose to focus on sequences for 54 unrelated individuals from 
the "69 genomes" data set from Complete Genomics (http:// 
www.completegenomics.com/ pubhc-data/69-Genomes) [58] . 
The 54 genome sequences were computationally phased using 
SHAPEIT v2 [59] and were filtered in various ways to minimize 
the influence from alignment and genotype-calling errors. They 
were partitioned into ~2-Mb blocks -And ARGweaver wds applied to 
these blocks in parallel using the Extreme Science and Engineering 
Discovery Environment (XSEDE). For this analysis, we assumed 
^=19, 5jf = 1,000,000 generations, 4Af^i = 5.8 x 10-^ and 
p=\.26y. 10^^, implying A'^= 11,534. We allowed for variation 
across loci in mutation and recombination rates. For each ~2-Mb 
block, we collected samples for 2,000 iterations of the sampler and 
retained every tenth sample, after an appropriate burn-in (see 
Methods for complete details). The entire procedure took 
~36 hours for each of the 1,376 2-Mb blocks, or 5.7 CPU-years 
of total compute time. The sampled ARGs were summarized by 
UCSC Genome Browser tracks describing site-specific times to 
most recent common ancestry (TMRCA), total branch length, 
allele ages, leaf traces, and other features across the human 
genome. These tracks are publicly available from our local mirror 
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of the UCSC Genome Browser (http://genome-mirror.bscb. 
cornell.edu, assembly hgI9). 

Distortions in the ARG due to natural selection. While 

our prior distribution over ARGs is based on the neutral 
coalescent, we were interested in exploring whether natural 
selection produces a sufficiently strong signal in the data to create 
detectable distortions in the ARG near functional elements. We 
began by examining the estimated posterior expected values of the 
TMRCA around known protein-coding genes, focusing on 
fourfold degenerate (4d) sites within coding exons and noncoding 
sites flanking exons. For comparison with our ARG-based 
measures, we also computed a simple measure of nucleotide 
diversity, n. Both n and the ARG-based TMRCA behave in a 
qualitatively similar manner near genes, achieving minimal values 
in coding exons and gradually increasing with distance from exon 
boundaries (Figure oA). These observations are consistent with 
several recent studies indicating reduced neutral diversity near 
both coding and noncoding functional (Jcmc-nts, which has been 
attributed to indirect effects from selection at linked sites [60-64] . 
However, it has been difficult to distinguish between two 
alternative modes of selection both predicted to have similar 
influences on patterns of neutral diversity: "background selection" 
(BGS) associated with negative or purifying selection at linked sites 
[65-68], and "hitchhiking" (HH) (selective sweeps) associated with 
linked mutations under positive selection [69] . In principle, explicit 
ARG inference could help to resolve this controversy, because 
BGS and HH lead to difiTerent predictions for the structure of 
genealogies (e.g., [70,71]). 

To examine these questions further, we computed the same 
statistics for 255 putative partial selective sweeps identified in CEU 
populations and 271 partial sweeps identified in YRI populations 
based on the integrated extended haplotype homozygosity statistic 
(iHS) [72]. As expected, the sweep regions were broadly similar to 
the protein-coding genes in terms of nucleotide diversity n 
(Figure 5B). However, unlike the protein-coding genes, the sweep 
regions displayed no clear depression in TMRCA. One possible 
way of understanding this observation is that, while sweeps tend to 
be enriched overall for recent coalescence events (as indicated by 
the reductions in n), the oldest coalescence events are relatively 
unaffected by selective sweeps, perhaps because some lineages tend 
to "escape" each sweep, leading to near-neutral patterns of 
coalescence near the roots of genealogies (where the contribution 
to the TMRCA is greatest). This may be particularly true for the 
partial sweeps identified by the iHS method, but a similar 
phenomenon should occur in flanking regions of the causal 
mutations for complete sweeps. BGS, by contrast, is expected to 
affect both the total branch length and TMRCA approximately 
equally, by effectively reducing the time scale of the coalescence 
process, but to have a minimal influence on the relative intervals 
between coalescence events. 

In an att('mpt to distinguish further between BGS and HH, we 
introduced a statistic called the relative TMRCA halflife (RTH), 
defined as the ratio between the time to most recent common 
ancestry for the first 50% of chromosomes and the fuU TMRCA. 
The RTH captures the degree to which coalescence events are 
skewed toward the recent past, in a manner that does not depend 
on the overall rate of coalescence. Thus, the RTH should be 
relatively insensitive to BGS, but sensitive to HH if, as proposed 
above, sweeps tend to affect many but not all lineages (see 
Supplementary Figure SIO). In the European populations, the 
statistic showed a pronounced vaUey near selective sweeps 
(Figure 5B), as expected, but it was much more constant across 
genie regions (Figure 5A). Its behavior was similar in the African 
populations, except that it showed somewhat more variability near 



genes, yet in an opposite pattern from the sweeps (Supplementary 
Figure SI 1). Overall, these results suggest that, while the total rate 
of coalescence differs substantially across genie regions, the relative 
depths of middle and extreme coalescence events do not, on 
average, consistent with the predictions of a model in which BGS 
dominates in genes [60,62,64]. The sharply contrasting patterns 
for the iHS-identified sweeps suggest that partial sweeps of this 
kind make at most a minor contribution to the reduced diversity 
near protein-coding exons. Nevertheless, these observations do not 
rule out the possibility that alternative modes of hitchhiking for 
which iHS has low power — such as recurrent hard or soft 
sweeps — might make a non-negligible contribution to patterns of 
variation near human protein-coding genes (see Discussion). 

Genomic regions with extremely ancient most recent 
common ancestry. The previous section showed that genomic 
regions with reduced TMRCAs are often associated with purifying 
selection. To see whether the opposite signal was also of interest, 
we computed the posterior expected TMRCA in 10-kb blocks 
across the human genome and examined the regions displaying 
the oldest shared ancestry. Not surprisingly, four of the top twenty 
10-kb blocks by TMRCA fall in the human leukocyte antigen 
(HLA) region on chromosome 6 (see Table 2). It has been known 
for decades that the HLA region exhibits extraordinary levels of 
genetic diversity, which is believed to be maintained by some type 
of balancing selection (overdominance or frequency-dependent 
selection) associated with the immunity-related functions of the 
HLA system [73-75] . The four HLA-related high-TMRCA blocks 
include three regions near HLA-F and one region between HLA-A 
and HLA-J (Supplementary Figure SI 2). All four high-TMRCA 
regions exhibit more than 12 polymorphisms per kilobase of 
unfiltered sequence, 8-10 times the expected neutral rate after 
normalizing for local mutation rates (as detailed in Table 2; see 
also Supplementary Figure SI 3). The estimated TMRCAs for 
these regions range from ~ 340,000-380,000 generations, or 
~8.5-9.5 My (assuming 25-year generations). 

Among these high-TMRCA blocks were two additional regions 
that displayed extraordinary levels of mutation-rate-normalized 
nucleotide diversity. The first of these, in a gene desert near the 
telomere of the long arm of chromosome 4, exhibits the deepest 
expected TMRCA in the genome, at >600,000 generations 
(15 My), and has >30 times the neutral polymorphism rate 
(Table 2). The second region is the PRIM2 gene on chromosome 
6, which contributes the 4th and 7th highest TMRCA blocks in 
the genome, exhibiting polymorphism rates 28.0 and 12.8 times 
the neutral expectation, respectively. Both of these regions were 
identified as extreme outliers in a recent study of coincident SNPs 
in humans and chimpanzees, and it was argued that the PRIM2 
gene was a likely target of balancing selection [76]. On closer 
inspection, however, we found that both regions were flagged by 
Complete Genomics as having "hypervariable" or "invariant" 
read depth across individuals, suggesting that the elevated SNP 
rates in our data are likely artifacts of copy number variation 
(CNV) at loci unduplicated in the reference genome. (Leffler et al. 
recently reached a similar conclusion about PRIM2 [77].) Despite 
that these flags were associated with only ~5% of genomic 
positions, they indicated that five of our top six regions were likely 
CNVs (Table 2). Thus, for all subsecjuent analyses reported in this 
paper and for our publicly available browser tracks, we filtered out 
all regions labeled as invariant or hypervariable. 

Once these extreme outiiers were excluded, several loci of 
interest remained. In addition to the four HLA loci, these included 
(7^5 in Table 2) an apparent cis-regulatory region downstream of 
the KCNE4 gene, which encodes a potassium voltage-gated 
channel (Supplementary Figure SI 4); (#9) an intronic interval in 
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Figure 5. Measures of genetic variation near protein-coding genes and partial selective sweeps. Shown (from top to bottom) are 
nucleotide diversity (k), time to most recent common ancestry (TIVIRCA), and relative TIVIRCA halflife (RTH) for the 1 3 individuals (26 haploid genomes) 
of European descent (CEU and TSI populations) in the Complete Genomics data set (similar plots for African population are shown in Supplementary 
Figure S1 1). Nucleotide diversity ji was computed as the average rate of nucleotide differences per site across all pairs of chromosomes, whereas 
sitewise values of the TMRCA and RTH were computed by averaging over local trees sampled by ARGweaver. (A) Estimates for 1 7,845 protein-coding 
genes from the Consensus Coding Sequence (CCDS) track in the UCSC Genome Browser (hg19). Estimates for noncoding regions were computed by 
averaging in a sliding window of size 300 bp then averaging across genes. Estimates for coding exons were computed by first averaging over 
fourfold degenerate (4d) sites of each exonic type (first, middle, last), then averaging across genes (see Methods). Only 4d sites were considered to 
focus on the influence of selection from linked sites rather than direct selection. Nevertheless, the decreased values for the exons suggest some 
influence from direct selection. The differences between exons and flanking sites may also be influenced by windowing in the noncoding regions. 
"First exon" is taken to begin at the annotated start codon and "last exon" to end at the stop codon, so that both exclude untranslated regions. The 
TMRCA is measured in thousands of generations. RTH is ratio of the time required for the first 50% of lineages to find a most recent common ancestor 
to the full TMRCA (see Supplementary Figure S10). Error bars (dashed lines for noncoding regions) indicate 95% confidence intervals as estimated by 
bootstrapping over regions. (B) Similar plots for 255 100-kb regions predicted to have undergone partial selective sweeps in the CEU population 
based on the iHS statistic [72]. In this case, all measures are computed in a sliding window of 10,000 bases. Notice that both protein-coding genes 
and putative selective sweeps display substantial reductions in nucleotide diversity, but the genes show a much more prominent reduction in 
TMRCA, whereas the sweeps show a much more prominent reduction in RTH. These signatures are consistent with a dominant influence from 
background selection rather than hitchhiking in protein-coding genes (see text). 
doi:1 0.1 371 /journal.pgen.1 004342.g005 



BCAR3, a gene involved in the development of anti-estrogen 
resistance in breast cancer (Supplementary Figure SI 5); (#16) an 
apparent regulatory region upstream of TULP4, a tubby-like 
protein that may be involved in ubiquitination and proteasomal 



degradation with a possible association with cleft lip (Supplemen- 
tary Figure S16); and (#18) an intronic region in CRHRl, which 
encodes a GPCR that binds corticotropin releasing hormones, has 
roles in in stress, reproduction, immunity, and obesity, and is 
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associated with alcohol abuse, asthma, and depression. Notably, all 
of these are predominantly noncoding regions that include 
multiple ChlP-seq-supported transcription factor binding sites. 
The estimated TMRCAs of these regions range from 335,000- 
450,000 generations (8.4— 1 1.3 My), suggesting genetic variation in 
these loci considerably predates the human/chimpanzee diver- 
gence. 

Segregating haplotypes shared between humans and 
chimpanzees. To explore the connection between extreme 
TMRCAs and balancing selection further, we examined 125 loci 
recently identified as having segregating haplotvpes that are shared 
between humans and chimpanzees [77]. These loci are expected 
to be enriched for ancient polymorphisms maintained by 
balancing selection, although some may reflect independent 
occurrences of the same mutation in both species. We compared 
these putative balancing selection loci with neutral sequences 
having the same length distribution (see Methods), and found that 
their ARGweaver-estimated TMRCAs were clearly shifted toward 
higher values, with a mean value nearly twice as large as that of 
the neutral sequences (Supplementary Figure SI 7). In addition, 
the putati\'e balancing sflcction loci that do not contain 
polymorphisms in CpG dinuck'otides — which are less likely to 
have experienced parallel mutations — had slightly higher 
TMRCAs than the group as a whole. 

If these loci are sorted by their estimated TMRCAs, several loci 
that were highlighted by Leffler et al. [77] for having more than 
two pairs of shared SNPs in high LD appear near the top of the list 
(Table 3). For example, the haplotype between the FREA43 and 
GTPE genes (#11 in Table 3; Supplementary- Figure SI 8) contains 
shared SNPs in almost perfect LD with several expression 
quantitative trait loci (cQTLs) for GTPE, a close paralog of a 
gene [GTPA) that encodes a receptor for Plasmodium falciparum and 
may be under balancing selection. Another haplotype (^^3) 
contains shared SNPs in significant LD with an eQTL for MTRR, 
a gene implicated in the regulation of folate mc-tabolism, including 
one SNP that is also segregating in gorillas. In a third case (#18), 
the shared SNPs occur in a likely enhancer in an intron oiIGFBP7, 
a gene that plays a role in innate immunity, among other 
functions. Another example is a locus near the ST3GAL1 gene (#7) 
that contains only one pair of shared SNPs but was suggested by a 
phylogenetic analysis to have an ancient origin [7 7] . Notably, all of 
these shared haplotypes fall outside of coding regions and several 
show signs of regulatory activity based on functional genomic data 
[77]. Their expected TMRCAs range from roughly 150,000 to 
250,000 generations, or 3.8-6.3 My. Thus, the ARGweaver 
estimates of age are reasonably consistent with the hypothesis 
that these hapolotypes predate the human/ chimpanzee divergence 
(estimated at 3.7-6.6 Mya [57]), an observation that is especially 
notable given that our analysis does not make direct use of data 
from chimpanz(X"s. 

By contrast, the loci n('ar the bottom of the list (with the shortest 
TMRCAs) appear to be much less convincing. For example, the 
bottom 20 have expected ages of only 25,000-50,000 generations 
(0.65-1.3 My), suggesting that they actually post-date the human/ 
chimpanzee divergence by millions of years. In addition, many of 
these regions appear hundreds of kUobases from the nearest gene, 
and they typically do not overlap regions with strong functional or 
comparative genomic evidence of regulatory potential. Indeed, if 
our ARG-based estimates of the TMRCA are interpreted literally, 
a majority of the 125 segregating haplotypes may post-date the 
human/chimpanzee divergence, which current estimates would 
place at ^150,000 generations ago (see Supplementary Figure 
SI 7). This observation is in general agreement with rough 
calculations by Leffler et al. suggesting that the false discovery 



rate for ancient balancing selection in this set could be as high as 
75% [77]. Thus, it appears that our ARG-based methods may be 
useful in distinguishing true ancestral polymorphisms from shared 
haplotypes that occur by chance due to homoplasy. 

Natural selection and allele age. Next we examined the 
ARG-based expected ages of derived alleles at polymorphic sites in 
various annotation classes. Classical theory predicts that both 
deleterious and advantageous alleles will not only have skewed 
population frequencies but will also tend to be younger than 
neutral alleles at the same frequency, because directional selection 
will tend to acc:elerate a new mutation's path to fixation or loss 
[78]. This idea has recendy been used to characterize selection in 
the human genome based on a haplotype-based summary statistic 
that serves as a proxy for allele age [79]. We computed ARG- 
based estimates of allele age in putatively neutral regions (Neut), 
fourfold degenerate sites in coding regions (4d), conserved 
noncoding sequences (CNS), missense coding mutations predicted 
by PolyPhen-2 to be "benign" (PPhiBenign), "possibly damaging" 
(PPhiPosDam), or "probably damaging" (PPh:ProbDam), and 
coding or noncoding mutations classified by the ClinVar database 
(http://www.ncbi.nlm.nih.gov/clinvar) as "nonpathogenic" (cate- 
gories 1-3; CViNonPath) or "pathogenic" (categories 4 & 5; 
CViPath) based on direct supporting evidence of phenotypic 
effects. We found, indeed, that the Neut mutations were 
significantly older, on average, than all other classes (Figure 6A). 
In addition, among the missense coding mutations, PPhiBenign 
mutations were the oldest, PPh:PosDam were significandy 
younger, and PPh:ProbDam mutations were the youngest. 
Similarly, mutations in the CV:NonPath class were significandy 
older than those in the CV:Path class. Interestingly, the 4d 
mutations showed substantially lower average ages (by >30%) 
than the Neut mutations. We attribute this reduction primarily to 
the effects of selection from linked sites (see [60]), although direct 
selection from mRNA secondary structure and exonic regulatory 
elements may also contribute to it. 

In part, these differences in age simply reflect differences in the 
site frequency spectrum (SFS) across classes of mutations. For 
example, missense mutations are well known to be enriched for 
low-frequency derived alleles, which will tend to be younger, on 
average, than higher-frequency derived alleles. To account for the 
influence of allele frequency, we further grouped the sites in each 
annotation class by derived allele frequency and compared the 
average allele ages within each group (Figure 6B). As expected, the 
estimated ages increase with the derived allele frequency across all 
annotation classes. In addition, within each class we continue to 
observe approximately the expected rank-order in allele ages, with 
Neutral mutations being the oldest, 4d, PPh:Benign, CNS, and 
CV:NonPath mutations coming next, followed by PPhiPosDam, 
PPhiProbDam, and CV:Path mutations. This analysis demon- 
strates that ARGweaver is able to obtain information about natural 
selection from allele ages beyond what can be obtained from the 
SFS alone. 

Another way of viewing these results is to consider the reduction 
in allele age relative to the neutral expectation within each 
frequency group, across annotation classes (Supplementary Figure 
SI 9). As expected, these reductions are larger at higher allele 
frequencies, where sojourn times will tend to be longer. However, 
from this representation it is also clear that the reductions in age 
increase with frerjuency much more rapidly for the mutations 
under strong, direct selection than for the mutations at which 
selection from linked sites is expected to dominate. For example, at 
very low derived allele frequencies (singletons), the reduction in 
age of 4d mutations is roughly equal to that at PPh:PosDam 
mutations, whereas at higher derived allele frequencies the 
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Figure 6. Mean allele age as a function of annotation class and derived allele frequency. (A) Estimated age of derived allele in 
generations, averaged across polymorphic sites of various annotation classes. Estimates were derived from ARGs sampled by ARGweaver based on 
the Complete Genomics data set (see IVlethods). Error bars represent one standard deviation above and belov*/ the mean. Neut = putatively neutral 
sites; 4d = fourfold degenerate sites in coding regions; CNS = conserved noncoding sequences identified by phastCons; PPh:{Benign,PosDam,Prob- 
Dam} = missense mutations identified by PolyPhen-2 as "benign", "possibly damaging", or "probably damaging", respectively; CV:{NonPath,Path} = 
mutations in "nonpathogenic" (categories 1-3) or "pathogenic" (categories 4 & 5) classes in the ClinVar database, respectively. (B) Similar plot with 
categories further divided by derived allele frequencies (DAF) in numbers of chromosomes out of 108. Error bars represent 95% confidence intervals, 
as assessed by bootstrapping. In categories that combine multiple frequencies (e.g., 4-5, 6-8), a subsampling strategy was used to ensure that the 
relative contributions of the different frequencies matched those of the Neut class. Estimates for DAF>20 were excluded due to sparse data. Notice 
that ages generally increase with DAF, as expected (see Supplementary Figure S7), but at a considerably reduced rate in categories under strong 
selection. 

doi:10.1371/journal.pgen.1004342.g006 



damaging mutations exhibit reductions in age 2-3 times larger. 
The reason for this observation is probably that the reduction in 
age for the nearly neutral sites is largely a consequence of 
reductions in the local effective population size due to selection at 
linked sites, while the reductions at sites under direct selection are 
driven by the influence of selection on sojourn times (see 
Supplementary Figure S19 for a detailed discussion). Consistent 
with this interpretation, CNS mutations show less reduction in age 
than 4d and PPhiBenign mutations at low frequencies, and more 
reduction at high frequencies, suggesting that CNS mutations are 
influenced less by selection at linked sites and more by direct 
selection. 

Discussion 

Several decades have passed since investigators first worked out 
the general statistical characteristics of population samples of 
genetic markers in the presence of recombination [21,80-83]. 
Nevertheless, solutions to the problem of explicitly characterizing 
this structure in the general case of multiple markers and multiple 
sequences — that is, of making direct inferences about the ancestral 
recombination graph (ARG) [19,20] — have been elusive. Recent 
investigations have led to important progress on this problem 
based on the Sequentially Markov Coalescent (SMC) [17,37-42], 
but existing methods are still either restricted to small numbers of 
sequences or require severe approximations. In this paper, we 



introduce a method that is faithful to the SMC yet has much better 
scaling properties than previous methods. These properties 
depend on a novel "threading" operation that can be performed 
in a highly efficient manner using hidden Markov modeling 
techniques. Inference does require the use of Markov chain Monte 
Carlo (MCMC) sampling, which has certain costs, but we have 
shown that the sampler mixes fairly well and converges rapidly, 
particularly if the threading operation is generalized from single 
sequences to subtrees. Our methods allow explicit statistical 
inference of ARGs on the scale of complete mammalian genomes 
for the first time. Furthermore, the sampling of ARGs from their 
posterior distribution has the important advantage of allowing 
estimation of any ARG-derived quantity, such as times to most 
recent common ancestry, allele ages, or regions of identity by 
descent. 

Despite our different starting point, our methods are similar in 
several respects to the conditional sampling distribution (CSD)- 
based methods of Song and colleagues [49-52]. Both approaches 
consider a conditional distribution for the «th sequence given the 
previous n—l sequences, and in both cases a discretized SMC is 
exploited for efficiency of inference. However, the CSD-based 
methods consider the marginal distribution of the nth sequence 
only given the other n — \ sequences and never explicitiy 
reconstruct an ARG, while ours considers the joint distribution 
of an ARG of size n and the «th sequence, given an ARG of size 
« — 1 and the previous n—l sequences. In a sense, we have 
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employed a "data augmentation" strategy by explicitly represent- 
ing full ARGs in our inference procedure. The main cost of this 
strategy is that it requires Markov chain Monte Carlo methods for 
inference, rather than allowing direct likelihood calculations and 
maximum-likelihood parameter estimation. The main benefit is 
that it provides an approximate posterior distribution over 
complete ARGs and many derived quantities, including times to 
most recent common ancestry, allele ages, and distributions of 
coalescence times. By contrast, the CSD-based methods provide 
information about only those properties of the ARG that are 
directly described by tlu- model parameters. We view these two 
approaches as complementary and expect that they wiU have 
somewhat different strengths and weaknesses, depending on the 
application in question. 

Our explicit characterization of genealogies can be exploited to 
characterize the influence of natural selection across the genome, 
as shown in our analysis of the Complete Genomics data set. In 
particular, we see clear evidence of an enrichment for ancient 
TMRCAs in regions of known and predicted balancing selection, 
reduced TMRCAs near protein-coding genes and selective 
sweeps, and reduced allele ages in sites experiencing both direct 
selection and selection at closely linked sites. Interestingly, the 
genealogical view apjjcars to have the potential to shed light on the 
difficult problem of distinguishing between background selection 
and hitchhiking. Our initial attempt at addressing this problem 
relies on a genealog^'-based summary statistics, the relative 
TMRCA halflife (RTH), that does appear to distinguish effectively 
bctwcc'n protein-coding genes and partial sel(;cti\c' sweeps 
identified by iHS. However, more work wiU be needed to 
determine how well this approach generalizes to other types of 
hitchhiking (e.g., complete sweeps, soft sweeps, recurrent sweeps) 
and whether additional genealogical information can be used to 
characterize the mode of selection more precisely. Additional work 
is also needed to determine whether our ARG-based allele-age 
estimator — which is highly informati\-c in bulk statistical compar- 
isons but has high variance at individual sites — can be used to 
improve functional and evolutionary characterizations of partic- 
ular genomic loci. A related challenge is to see whether our 
genome-wide ARG samples can be used to improve methods for 
association/LD mapping (see [34,84-88]). 

In addition to natural selection, our methods for ARG inference 
have the potential to shed light on historical demographic 
processes, an area of particular interest in the recent literature 
[16,17,51,52,89]. To explore the usefulness of ARGweaver in 
demography inference, we attempted to infer a population 
phylogeny with admixture edges for the 1 1 human populations 
represented in the Complete Genomics data set, based on the 
genealogies sampled under our naive (panmictic) prior distribu- 
tion. We extracted 2,304 widely spaced loci from our inferred 
ARGs, obtained a consensus tree at each locus, and reduced this 
tree to a subtree with one randomly selected chromosome for each 
of the 1 1 populations (see Text S 1 for details). We then analyzed 
these 1 1 -leaf trees with the PhyloNet program [90,91], which finds 
a population tree that minimizes the number of "deep coales- 
cences" required for reconciliation with a given set of local trees, 
allowing for both incomplete lineage sorting and hybridization 
(admixture) events between groups. PhyloNet recovered the 
expected phylogeny for these populations in the absence of 
hybridization and generally detected complex patterns of gene 
flow where they are believed to have occurred, but it had diSiculty 
reconstructing the precise relatinonships among source and 
admixed populations (Supplementary Figure S20). These exper- 
iments suggested that the posterior distribution of ARGs does 
appear to contain useful information about population structure 



even when a noninformative prior distribution is used, but that 
additional work will be needed to fuUy exploit the use of ARG 
inference in demographic analysis. 

An alternative strategy would be to extend our methods to 
incorporate a full phylogenetic demographic model, such as the 
one used by G-PhoCS [92], thereby generalizing this fully 
Bayesian method to a setting in which recombination is allowed 
and complete genome sequences are considered. Importantly, the 
use of the complete ARG would allow information about 
demographic history from both patterns of mutation and patterns 
of linkage disequilibrium to be naturally integrated (see [92]). 
However, as with CSD-based methods [51,52], an extension to a 
full, parametric multi-population model for application on a 
genome-wide scale would be technically challenging. In our case, 
it would require the ability to sample "threadings" consistent with 
the constraints of a population model (e.g., with no coalescent 
events between genetically isolated populations) and exploration 
of a full collection of population parameters, which would likely 
lead to slow convergence and long running times. Nevertheless, a 
version of this joint inference strategy may be feasible with 
appropriate heuristics and approximations. Our methods may 
also be useful for a wide variety of related applications, including 
local ancestr)' inference [47,93,94], haplotype phasing/ g(;n<)t)'pe 
imputation [46,48,95,96], and recombination rate estimation 
[22,97]. 

Our initial implementation of ARGweaver reUes on several 
simplying assumptions that appear to have minimal impact on 
performance with (real or simulated) human serjuence data, but 
may produce limitations in other settings. Following Li and 
Durbin [42] , we compute probabilities of recombination between 
discrete genomic positions under the assumptions of the contin- 
uous-space SMC [37]. When recombination rates are low, the 
discrete and continuous models are nearly identical, but the 
differences between them can become significant when recombi- 
nation rates are higher [98] . Similarly, our assumption of at most 
one recombination event per site and our use of the SMC rather 
than the improved SMC' [38] may lead to biases in cases of higher 
recombination rates, larger numbers of sequences, or more 
divergent sequences. In addition, our heuristic approach of 
accommodating zero-length branches by randomly sampling 
among "active" branches for coalescence and recombination 
events (see Methods) may lead to biases when the discretization 
scheme is coarse relative to evolutionary events of interest. Finally, 
we currentiy assume haploid genome sequences as input, which, in 
most cases of current interest, requires computational phasing as a 
pre-processing step. Phasing errors may lead to over-estimation of 
recombination and mutation rates and associated biases, because 
the sampler wiU tend to compensate for them with additional 
recombination and/or mutation events. In principle, most of these 
limitations can be addressed within our frame^\'ork. For example, 
it should be fairly straightforward to extend ARGweaver to use the 
SMC and Hoboltli and Jensen's finite-loci transition density. In 
addition, we believe it is possible to enable the program to work 
directly with unphased data and integrate over all possible 
phasings (see, e.g., [92,99]). 

The ability to perform explicit ARG inference on the scale of 
complete genomes opens up a wide range of possible applications, 
but the long running times required for these analyses and the 
unwieldy data structures they produce (numerous samples of 
ARGs) are potential barriers to practical usefulness. In our initial 
work, we have attempted to address this problem by precomputing 
ARGs for a highly informative public data set and releasing both 
our complete ARGs and various summary statistics as browser 
tracks for use by other groups. We have also developed a simple 
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web interface that allows users to retrieve local trees and several 
useful summary statistics for specified genomic intervals, popula- 
tions, and individuals (http://compgen.bscb.comell.edu/ 

ARGweaver/CG_results). In future work, it may be possible to 
improve data acces.s by providing more sophisticated tools for data 
retrieval and visualization. For example, sampled ARGs could be 
stored in a database in a manner that allowed researchers to 
efficiendy extract features such as regions of IBD or recombination 
maps for designated subsets of samples. A related possibility would 
be to support on-the-fly threading of user-specified query 
sequences into precomputed ARGs. This operation would be 
analogous to local ancestry inference [47,93,94], but would reveal 
not only the population sources of query sequence segments, but 
also additional information about recombination events, coales- 
cence times, approximate mutation ages, and other features. The 
same operation could be used to allow our sampling methods to 
scale to thousands of genomes: one could infer ARGs for, say, 100 
genomes, then simply thread in hundreds more, without fuU 
MCMC sampling. In general, we believe that posterior samples of 
ARGs win be a rich resource for genetic analysis, but additional 
work is needed on data storage and query interfaces for these 
samples to become practically useful to large numbers of genomic 
researchers. 

Methods 

Discretized Sequentially Markov Coalescent 

Discretization scheme and notation. The Discretized 
Sequentially Markov Coalescent (DSMC) assumes that all 
coalescence and recombination events occur zt K+l discrete 

time points, V = {.'iQ, S],S2,...,Sx}, with .Vo = 0 (the present time) 
and Sk equal to a user-specified maximum value. These time 
points are defined in units of generations before the present time. 
We evenly distribute these time points on a logarithmic scale, so 
that the discretization scheme has finer resolution near the leaves 
of the ARG, where more events are expected to occur. 
Specifically, we define Sj (for 0<j<K) to be Sj=g(J), where 



denote the lengths of the half intervals between J ^ 2 '^'^'^ ^^'^ 



g(j)-- 



1 



exp 



K 



log(l +dSK) 



(4) 



Here, is the maximum time and 5 is a tuning parameter that, 
when increased, causes the time points to become more densely 
clustered near the leaves of the ARG. Notice that ^(0) = 0 and 
g(IC) = SK- In this work, we have assumed ijf = 1,000,000 
generations and 3 = OA. We denote the length of time interval j 
as Asi=Sj-f-i—Sj. The DSMC process is defined such that it 
approaches the continuous SMC as a limit as K^'J^ and each 
Ai,->0, with Sk sufficientiy large that the probability of a 
coalescence event older than Sk is close to zero. 

It is useful to specify "midpoints" between time points (on a log 
scale), to facilitate rounding of continuous-valued times to the 
nearest discrete time point. We define the midpoint between times 

Sj and Sj+i (for 0<j<K) as Sj^^ = g(j+ -). We can alternatively 
refer to the midpoint between times Sj-i and Sj as Sj_^=g(j — -) 



1, 



1, 



(for \<j<K), noting that Sj_x^=g{j- -) = g{{j-\)+ -) = 
Coalescence events that occur between Sj_^ and Sy^i 
are "rounded" to time point Sj. We found that it was less critical to 
round recombination events to the nearest time point, so they are 
simply rounded to the next most recent time point (see below). We 



between j and j + -, as A.y 



, and A.?; , , 1 



respectively. 



2- I- 

Because all coalescence events must occur at the designated 
time points, the collection of branches is fixed for each interval / 
between time points Sj and Sj+\. Given a local tree T" that is 
consistent with the DSMC, we denote the set of branches in time 
interval j as B{T",j). The size of this set, \B(T" ,j)\, is of particular 
interest, and is abbreviated Bj (with T" clear from context). In 
addition, it is often of interest to consider the branch sets for a tree 
T" from which a branch w has been removed. We denote such a 

tree by Tf'^ ''^ and abbreviate the number of branches in interval 
j as -B^ "'^ (again, with T" clear from context). 

One consequence of discretizing time is that the DSMC will 
tend to generate ARGs that contain many branches of length zero 
(corresponding to polytomies in the local trees), which will have 
zero probability of recombination, coalesce, or mutation events. In 
effect, the rounding procedure will tend to shrink short branches to 
zero, which may lead to distortions in data generation and 
inference. We address this problem heuristically, by defining the 
DSMC to first sample the times of recombination and coalescence 
events, and then randomly select a branch from all of those 
"active" at the sampled time point. We define the set of active 
branches at a time point Sj, for a local tree T", to be those 
branches in T" that start, end, or pass through Sj. This set is 
denoted A(Ti,j) and its size is abbreviated as Aj. As above, we use 
Aj "'^ to indicate the active branches at Sj excluding branch w. 
Simulations indicate that this heuristic solution to the problem of 
zero-length branches works fairly well in practice (see Figure SI). 

Recombination process. As in the standard SMC, recom- 
binations are assumed to occur according to a Poisson process with 
rate p|T'"_[|, where |77_[| is the total branch length of local tree 
T"_ J and p is the average number of recombinations/generation/ 
site. Once a recombination occurs, the ordinary SMC process 
places the recombination uniformly along the branches of T" ^. 
The, analogous operation of sampling a recombination branch and 
time point, R" = (w, Sk), in the DSMC is accompUshed by first 
sampling a time point S/^ in proportion to the total branch length 
present during time interval k, then randomly selecting one of the 
A/^ branches active at that time point. Consistent with the 
assumptions of the SMC, the recombination point cannot occur 
above the time point associated with the root r of tree T"_^, which 
we denote Sr. Thus, the sampling distribution for a recombination 
point R" on a local tree 7'"_j is given by, 

(5) 

exp(-p|7';_,l) if-R? = 0 

■:^-[l-exp(-pj7'«_,|)] if R'; = (n;sk), weA(T!'_,,k)fl<si,<Sr 

■^■[l-exp(-p|r«_i|)] if «« = (H.,«i), weA(T;-_,,r)\{r},s,=s, 
0 otherwise, 



where C= X]/=o ^j^^i ^ constant that explicitiy normalizes the 
distribution over time points Sq, . . . , s,. The special case for the 
time point at the root of the tree, Sk=Sr, is required because the 
SMC does not allow recombinations to occur beyond this point, so 
the effective number of active branches is only two at this time 
point, despite that A,, will have a value of three. The number of 
branches in the interval above the root, B^, is necessarily one, so 
this term can be omitted in this case. 
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This sampling distribution efiectively rounds the times of 
recombination events downward to the next most recent time 
point. However, a strict poKcy of downward rounding, together 
with a prohibition again recombination events above the root 
node, would make it impossible to sample recombination events at 
time point which turns out to have undesirable effects in 
inference (it makes some trees unreachable by the threading 
operation). Therefore, when sampling time points, we use the 
heuristic approach of imagining that recombinations can also 
occur in the time interval immediately above the root and 
assigning these events to the time point s^. This has the effect of 
redistributing some of the probability mass from later time points 
to the root, without altering the overall rate at which recombi- 
nations occur (fi\T"_^\). For this reason, the normalizing constant 
C differs slightly from the total branch length | T"_ j |; in particular, 
C = \T"_^\-^- Asr. It would be slightly more elegant to aUow 
upward as well as downward rounding of times for recombina- 
tions, as we do with coalescence events (see below), but as long as 
the time discretization is not too coarse these differences are of 
minor importance, and the approach we have used seems to be 
adequate. 

Re-coalescence process. Once a recombination point 
R" = (w, Sk) is sampled, the selected branch w is removed from 
time points s/^ and older, and sdlowed to re-coalesce to the 
remainder of the tree, in a manner analogous to the SMC. 
Because we explicitly prohibit multiple recombinations between 
adjacent positions, the local tree T" must be reachable from T" ^ 
by a single "subtree pruning and regrafting" (SPR) operation 
corresponding to the recombination, i.e., an operation that cuts a 
branch of the tree at the recombination point and re-attaches it 
(and any descendant nodes) to the remainder of the tree. Thus, we 
can write, 

(6) 



P(rf|j?^,rf_,,0) 

x^j\w^k,'I7_i,@) if J?? = (h', Sk), (x, sj)s.t. Tf = SPR(T;'_^,w,Sk^^j), sj>Sk 
0 otherwise, 



where SPR{Tf_^, w, s^, x, Sj) is a function that returns the new 
tree produced by an SPR operation on T"_ j that cuts branch w at 
time .?i and re-attaches it to branch x at time and 
P(x,Sj\w, Sk, T"_i, 0) is a joint conditional distribution over re- 
coalescence branches and time points. 

The main challenge is therefore to define the discrete re- 
coalescence distribution, P(x,Sj\w, s^, T" ^, &), for Si>Sk (as 
required by the SMC). There are two distinct cases to consider: 
Sj>Sk and Sj = Sk. When Sj>Sk, the unattached branch w must 
first fail to re-coalesce during the interval between Sk and and 
then must re-coalesce between and , i (because all such re- 
coalescence events will be rounded to Sj). By contrast, when Sj = Sk, 
the branch w must simply re-coalesce between sj { = Sk) and Sy^i 

(because the re-coalescence time is strictiy bounded by the 
recombination time). 

In all cases, the instantaneous rate of re-coalescence in each 
interval / {k<l<j) is given by B\'''^/(2Ni), in the standard 
manner for the coalescent. (Note that we use -Sj rather than _S/, 
because we are concerned with the coalescence rate to the 
remainder of the tree, excluding branch w. We also assume a 
diploid species throughout, so the total number of chromosomes 
per locus is 2N.) The probability that a lineage starting at a time 
coalesces before si+i is given by the cumulative distribution 
function for exponentially distributed waiting times, 



Wil, /-hl) = l-exp - 



2Ni 



(V) 



and the probability of coalescence during a sequences of intervals, 
m, m+l, . . . , n—l is given by, 



W{m, n) = l — exp — ^ ' 



2Ni 



(8) 



Similarly, the probabilities of coalescence during the half intervals 
before and after time point si are given, respectively, by, 



2N,-i , 



(9) 



Thus, the distribution of re-coalescence times for the case of Sj>Sk 
is given by, 



P(sj\w,Sk,T^_i,@)-- 



\-W[k,j 



= exp 



2Ni j 2Nj- 



E 



(10) 



1 — exp — - 



2M- 



The probability of re-coalescence for the case oi Sj = Sk is simply, 

(11) 



P{Sj = Sk\w,Sk,T1_l^&)= W[ k, k + 



I — exp 



^r'^k,.4 

2Nk 



Finally, the requirement for re-coalescence by the maximum time, 
Sk, is enforced by exphcitiy normalizing the distribution: 



Once the coalescence time point Sj is chosen, a lineage x is 

uniformly chosen from the active lineages in at that time 

point, similar to the process for recombination events. Thus, 

P{x,Sj\w,Sk,Ti-i,®)= -^^P{sj\w,Sk,Ti-\,@), and equation 6 can 

J 

be rewritten as, 
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J'(77|J??,77_,,0) = 

r 1 if Jfl = 0,T^ = T?.i (13) 

J -ra^fel""'*'^-!'®) if = (.x,sj)s.t.TY = SPR(IJ_i, w, St, x,sj), st<sj<SK 



0 



otherwise, 



where P(Sj\w, Sk, Tf_^, 0) is given by equations 10-12. 

Initial local tree. The DSMC begins by generating an initial 
local tree, T", using a discretized version of the coalescent process. 
This process can be decomposed into two steps: (1) the generation 
of a sequence of branch counts, C = (Co,Ci, . . . ,Ck) for time 
points So,Si, . . . ,Sk, and (2) sampling of a topology T" consistent 
with these branch counts. The probability of an observed initial 
tree T" can therefore be calculated as, 



P(T^\@) = P(T"i,C\®) = PiC\N) P{T"i\C), 



(14) 



where A' is a vector of effective population sizes, 
N = {No, . . . ,Nk)- The branch count for time 0 is constrained to 
be equal to the number of samples, Co = w, and the branch count 
for time K is required to be one, Ck = 1 (see below). 

Since the coalescent process is Markovian in time, the 
distribution for the vector of branch counts can be factored by 
time intervals. 



P{C\N) = PiCo) nP(Q|C,_i,M_i,Ar,_i), (15) 

with degenerate first and last terms, P(Co) = /[Co = «] and 
P{Ck\Ck-uNk-{)=I[Ck = V\. 

The conditional distributions of the form P(C/|C/_i, 
Ai/_i,iV/_i), for 1 < /<.Sr, have been derived previously as [100], 



P{Ci = b\Ci- 1 = a,A$,_ 1 = t,Ni-i) 

^ ^ f -k(k-l) \ (2k-l)(-lf-' 'y' (h+y){a-y) 
^ ""^y 47V;_i J b\(k-by.(k + b-l)y=o a+y 



(16) 



Hidden Markov Model 

Hidden Markov model for full threading problem. As 

noted in the Results section, the complete data likelihood function 
under the DSMC is given by equation 2. If the full ARG 
G" = {T",R") is regarded as a latent variable, this equation defines 
a hidden Markov model with a state space given by all possible 
pairs {T",R"), transition probabilities given by expressions of the 
form P(Rl\T';_^,p) P{T'^\R';,T1_^,N) and emission probabilities 
given by P{D"\T" ,n) (see Figure 3A). The transition probabilities 
can be computed using equations 5 and 13, and the emission 
probabilities can be computed using Felsenstein's pruning 
algorithm. This model can be viewed as an instance of the 
"phylo-HMMs" that have been widely used in comparative 
genomics [101]. As discussed in the Results section, however, 
unless the number of sequences n is very small, the state space of 
this HMM will be too large to allow it to be used directly for 
inference. 

Instead, we constrain the inference problem by fixing the ARG 
for the first n—l sequences, G"^', and sampling from the 
conditional distribution P(G"\G"~\iy',@). Using the notation 
G" = (T",ir) and G"-'^ =iT"-\R"-^), we defme T" = 



{T" ' , F), where Y = (yi, . . . ,yi} is a vector of coalescence points 
such that yi = {xi,ti) indicates a coalescence of the wth serjuence to 
branch x, and time point f, of local tree T"^^, and 
/f" = (/f"^', Z), where Z = (z2, ■ ■ ■ ,zi) is a vector of recombina- 
tion points such that z, = (Wi,M,) indicates a recombination at 
branch w,- and time point M,- of local tree T"sl between positions 
i—\ and /. (Note that Zi is undefined.) Thus, we can sample from 
the desired conditional distribution P{G"\G"^^ ,D",&) by sam- 
pling from P{Y,Z\T"-\R"-\iy',&). We refer to a sample 
(Y, Z) from this distribution as a threading of the wth sequence 
through the ARG (see Figure 3B). For now, we will consider a 
complete threading (Y,Z), but in later s(;cti<)iis we will describe 
our two-step process for sampling, first, the coalescent threading 
Y, and second, the recombination threading Z given Y. 

Note that the restriction to one recombination event per 
position implies that z, = 0 wherever -R"^'#0, and that 
T"Z\ =Tf^^ wherever z, #0. This restriction is not stricdy 
required but it simplifies the description of new recombination 
events z,-, and in the setting of interest here it comes with little cost 
(see Discussion). 

It turns out to be more convenient to work with the joint 
distribution P{T"-\ Y , R"-\ Z, ir\&) (the complete data 
likelihood) than with the conditional distribution 
P{Y,Z\T"-'^,R"-'^,ir,&). However, to emphasize that die 
variables T" ' and R" ' are held fixed ("clamped") at prc'- 
specified values throughout the threading operation, we denote 
them as T"~^ and R"~^, and refer to the distribution of interest as 
P(T"'\ Y,g''\z,D"\Q). (Notice that the data IT are also 
clamped, as usual for HMMs.) When T"-\ R"-\ and D" are 
clamped, 



p( r" \y,r""\z.r"\&) x p(y,z\t'' 



-\R"-\^,e). 



(17) 



Thus, samples of (F,.Z) drawn in proportion to the unnormalized 

density P(T" Y, R" Z, I)"\Q) will be valid samples from 
the desired conditional distribution. In generad, any clamped joint 
density function, P(A, B), can be viewed as an unnormalized 
version of a corresponding conditional density function, P(A\B), 
but sometimes the joint density is more convenient to manipulate. 

We can now write the density function for the (unnormalized) 
sampling distribution for a threading {Y,Z) as, 



P{T" \ Y, R" \ Z, D"\&) = 
P(f",-\ymPiD,\Tr\yi,ii) 

(18) 

np(^«-', z,!?",-/, p) P(fr\ yi\Rr\ T^-l,yi-u N) 

xP(bi\f^-\yuH), 

\\hcrc all t(;rms arc- computable using previously described 
expressions, as for equation 2. 

Notice that this threading HMM has the same conditional 
independence structure as the HMM for the full DSMC (equation 
2, Figure 3), but its state space is now defined by sets of possible 
(y,, z,) pairs rather than the set of possible (7?,J?") pairs, making it 
far more tractable for inference. 

Reduced model for coalescent threading. The state space 
can be reduced further by proceeding in two steps. First, we 
sample a coalescent threading Y from the marginal distribution 

P(T"~\Y,Rf'\D"\@)ocPiY\¥'^\R"~\D",®). Then we sam- 
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pie a recombination threading, Z, from the conditional distribution 

P{Z\Y,T''\lt'\®). Notice that the data need not be 
considered when sampling the recombination threading, because 
Z is conditionally independent of O" given Y, J""', and 
R"-\ 

The marginal distribution P(T" Y,it ',^"1©) can be 
computed efficiently by changing the order of products and sums 
in the usual way for HMMs: 



P{f \y,iC ',O"|0) 

= ^P(r""',F,ff""',Z,O"|0) 

z 

= P{T",-\y,\N)P(b, I T",-\yuli) 



(19) 



n 

i=2 



xP{b,\Tf-\yi,n) 

= P{ri-\y,\N)P{bi I T",-\yul£) 



nP(R1-\fr\yi \T^I^,yi-uP,N)P(bi 1 Tr\yi,ix). 
1=1 



This equation defines an HMM with a state space given by the 
possible values of only, the size of which is bounded by nK, 
where n is the number of sequences and K is the number of time 
intervals (see Figure 3C). 

While this model has the conditional independence structure of 
a standard HMM, the state space is heterogeneous along the 
sequence, because the set of possible coalescent points at each 
position / depends on the local tree, T"^'. (The full threading 
HMM described above also has this property.) If we denote the 
state space at position i as 5,-, the transition probabilities between 
states in position i—\ and states in position / are defined by a 
|>S,_i| X |iSi| transition matrix ={a'f^} where / and m index 
the states of |>S,_i| and |>S,|, respectively, and can be 

computed as, 



P(ff - ' ,yi = m\R1-' ,z,-,r«ri' ,yi- 1 = l,N) 



(20) 



using equations 5 and 13. The emission probability for align- 
ment column D" in state / in 5, is denoted h'^(D") = 
P(iyi\TJ~^ ,yi = m,n) and can be computed using Felsenstein's 
pruning algorithm, as in all cases above. The initial state 
probabilities for the HMM are given by Tim = PiJ""^^ ,yi =m\N) 
for l<m<|iSi| and can be computed using equations 14-16. 

Notice that, unlike with a standard, locally normalized HMM, it 
is not true in this model that Ylm^'l m~^- Furthermore, for two 
positions i and j, it is not true in general that ^'i m ~ '^i „,> 
because of differences across positions in the local trees T"^' and 
recombination points Similarly, it is not true that 

X^m = 1 • Thus, this model is not only globally unnormalized, 
but it also has a heterogeneous local normalization structure across 
positions. Importantly, this heterogeneity stems directiy from 



differences in the 7? ' and i?" ' and is inherent in the threading 

problem — that is, it is not possible to express the desired 

conditional distribution, P{Y\T ,R ,D ,©), directiy in terms 
of a locally normalized hidden Markov model (one in which all 
transition probabilities and all emission probabilities sum to one at 
each position in the sequence). For this reason, we find it most 
convenient to work with the unnormalized clamped joint 
distribution. 

Stochastic traceback. Despite the unusual features of the 

HMM described in the previous section, it still permits the use of 
standard dynamic programming cilgorithms to integrate over all 
coalescent threadings Y (the forward or backward algorithms), 
obtain a most likely threading Y (the Viterbi algorithm), compute 
marginal post(;ri()r distributions for each yi (forward-backward 
algorithm), and sample threadings in proportion to their condi- 
tional probability [102,103]. These algorithms depend only on the 
linear conditional independence structure of the model (and, 
equivalentiy, on its factorization into local transition and emission 
probabilities) and on the use of nonnegative potential functions, 
both properties that are maintained in this model. 

We are primarily interested in a dynamic programming 
algorithm for sampling from the posterior distribution over 
HMM paths that is sometimes referred to as the stochastic taceback 
algorithm [103-105]. In our case, each application of this 
algorithm is guaranteed to sample a coalescent threading Y in 

proportion to the density P(r" ',y,^" ',5"|0), and equiva- 
lentiy, in proportion to the desired conditional distribution. 

The stochastic traceback algorithm consists of a deterministic 
forward pass and a stochastic backward pass. The forward pass is 
identical to the forward algorithm. In our notation, the algorithm 
recursively fills out a matrix f^ = {fifn\> l<i<L, 
1 < w < max,(|<S, I). Eachyj n, represents the probability of a prefix 
of the data joint with a constraint on the state path at position 

Here, fi^m= P(f".i ^ ,D\j,yi = m\&), where the notation 
indicates the subsequence (X,-, . . . ,Xj). After an initialization of 
f\,m = T^mb]„(iyi), for l<w<|>Si|, the algorithm proceeds itera- 
tively for i from 2 to Z- and sets each value fi^ (for 1 < W< |>S,|) 
equal to. 



i-1 



(21) 



Note that the heterogeneity of the state space along the sequence 
implies that portions of the matrix are left undefined. 

In the backward pass, the algorithm samples a sequence Y one 
element at a time, starting with yi^ and working backward to yi. 
First, yL = l is simply sampled in proportion to /lj. Then, for i 
from L — 1 down to 1, each is sampled conditional on in 
proportion to. 



qi(yi = l I yi+i =m)a:fij 



(22) 



The limiting step of the algorithm is the forward pass, which in 
general requires 0{C^L) time, where C is the size of the state 
space, C= max, (|S,|). However, in our case the structure of the 
Ai matrices can be exploited to reduce the running time to 
0{nK^L) (see Text SI). 

It can be shown by induction on suffixes of Y that this 
procedure wUl correctly sample from the target distribution, 
P{Y\T"'\R"'\b'',@). Briefly, in die base case, the suffix 
yL = l is by construction sampled from the density 
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fLI = P(T" ^ ,D" ,y£ = l\&), which is proportional to the 

desired conditional distribution, P(yL = l\T" ',D",0). For 

the inductive case, assume Yi+i-x has been sampled from 

P(Yi+i:L\T" ^ ,R" ',^",0). The procedure of sampling from 
Qi given ji+i is equivalent to sampling from, 

qi(yi = l I yi+i =m)ccfij a'l^ 

= P{f\^\B!\r\bl^,y, = l\0)P 

{R1-l,f1-,\y,^, =m I Tl-\y, = Up,N) (23) 

= P{r^,Xx:^,J]l\,^, = l^,^x = m I 0) 

where the last step is possible because >',- is conditionally 
independent of I',+2:Z,, ^"^^-.L^ ^l+i.L' D"^i.^ given yt+i. 
Thus, the algorithm will correctly sample from 
PiYi,L\T"'\R"'\D",®) for all i such that 1 <i<L. 

Sampling a recombination threading. The final step in the 
threading operation is to sample a recombination threading Z 
conditional on a coalescent threading Y and the clamped 
parameters. This step is greatly simplified by the fact that the 
individual z,- values are conditionally independent of one another 
given the yj variables and the clamped T"~' and variables 
(see Figure 3B). Consequently, each z,- can be sampled separately 
from the distribution, 

pcz,- 1 «r',f«-',j>,,f--i',j),_i,0)KP(^ri,z,- 1 
^(77-1,^,. I «ri,z,-,f--/,j),_„Ao, 

where the >>,■ variables are now clamped along with the T"^' and 
variables. Notice that the distribution on the RHS is the 
same one considered in equations 19 & 20. The normalizing 
constant for this distribution, for clamped values yi-\=l and 
yt = m, is given by the transition probabiUty a'f^ . 

Notice that this distribution is implicitly degenerate in the case 
in which .R"^'#0, owing to the limitation of at most one 
recombination event per position. In particular, if .R"^' # 0, then 
P(^r',z,|f«rii,j,_i,p) = /[z, = 0], hence P{zi\R!}-\ff-\ 
yi,T"Zi ,yi-\,&) = I\zi = 0\. At the same time, notice that, if 
R"^^=0, a new recombination is still possible (z,-#0) even if 
T^Z\ = T"^ ' and yt-i =yi, because a branch could be broken by 
a recombination event but then re-coalesce at precisely its original 
position in the local tree. 

When R"^^ =0, the efficiency of sampling from this distribu- 
tion can be improved by noting that most possible z, values stiU 
have zero probability. Let Z represent the set of z,- values having 
nonzero probability for given values of yt-i, y,-, and v, where v 
denotes the branch being threaded. There are two cases to 
consider, a main case and a special case. We wiU denote the 
corresponding subsets of z; values Zi and Z2, with Z = Z\\JZ2. 
Recall that z, = (vf,-,M,) and yi = {xi,ti), where Xi and vc,- are 
branches in T"Z\ and Tf"'' respectively, and m,- and ti are time 
points from the set P = {.5o, . . . ,.?a:}. In the main case, the 
recombination occurs on the new branch v. Here, the recombi- 
nation time Ui must be at least as recent as both the old and new 



re-coalescence times, and /,. Thus, Z\=\{y,Ui)\ 

«ie7',M,<min(r,_i,r,)}. Notice that \Z\\<K-\-\. 

The special case occurs when the recombination occurs not on 
the new branch, v, but instead on the branch to which v re- 

coalesces at position z — 1. A recombination on branch X/_i, below 
the point at which v joins it, followed by a re-coalescence of x,_i 
to V (meaning that X,- = x,-i) wiU produce a signature exactiy like 
the symmetric case of a recombination on v followed by a re- 
coalescence to x,_i (Supplementary Figure S21), so this scenario 
must also be considered. This case can only occur when x,_i =X; 
and in the interval of time between the start of branch X/ and 
min(?,_i,;,). Recombinations on other branches need not be 
considered, because the existence of such a recombination would 
imply that J?,- # 0, contrary to our assumption. Hence, 

Z -[ I - - '^^^ Xi-i= Xi 

\ 0 otherwise, 

where is the time point of the child node of branch x,-. As with 
2i, \Z2\<K+\. 

By enumerating the elements of Z, it is possible to sample each 
Zi in 0{K) time. The same approach can be used to enable 
calculation of the values (equation 20) in 0{K) time. 

Data Preparation 

Simulated data. Except where noted otherwise, simulations 

were performed under the full coalescent-with-recombination 
model [2 1] . After generation of local trees, sequence alignments 
were generated using a finite-sites Jukes-Cantor model [53]. All 
simulations were performed using custom computer programs. 
Our standard simulation scheme involved the generation of of 
twenty 1-Mb sequences, assuming an effective population size of 
7V'= 10,000, a mutation rate of 1.8x10"^ mutations/site/ 
generation, and mutation-to-recombination rate ratios of 
nl pe{\,l,A,6} (i.e., recombination rates of pe{l. 8,0. 9,0.45,0. 3} 
X 10~* events/site/generation). One hundred replicate data sets 
were generated for each choice of uj p. Alternative parameter 
values were used in certain cases, as noted in the text and figure 
captions. 

Real data. Information about human polymorphisms came 
from the "69 Genomes" data set from Complete Genomics (CG) 
(http:/ / www.completegenomics.com/ public-data/ 69-Genomes). 
For each individual considered, we recorded the diploid 
genotype call reported for each position in the hgl9 (Genome 
Reference Consortium Human Build 37) reference genome using 
CG's 'masterVar' files. We considered both "SNPs" and "length- 
preserving substitutions" in the masterVar file, and also noted 
positions where CG could not confidendy assign a genotype. All 
other positions were assumed to be homozygous for the allele 
reported in the reference genome. 

Borrowing from our previous work on demography inference 
[92], we appKed several filters to these data to reduce the impact of 
technical errors from ahgnment, sequencing, genotype inference, 
and genome assembly. These filters include simple repeats, recent 
segmental duplications, and transposable elements. We phased the 
data using SHAPEIT v2 [59] , guided by the pedigree information 
describing the relationships among the 69 individuals. After 
phasing, we removed the child in each trio, as well as all but the 
four grandparents in the 17-member CEU pedigree, leaving 54 
unrelated individuals in our data set. From this set, we further 
filtered all CpG sites, sites with more than two observed alleles, 
and sites at which CG did not call a genotype in any of the 54 
individuals. AH genomic positions excluded by our filters were 
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treated as missing data by ARGweaver, meaning that the program 
integrated over all possible nucleotides at these positions (as in [92]). 

In order to account for region-specific variation in recombination 
and mutation rates, we used the HapMap phase II recombination 
map [106] and a mutation rate map estimated from alignments of 
several primate genomes, including chimpanzee (panTro2), orang- 
utan (ponAbe2), and rhesus Macaque (rheMac2) [107]. Mutation 
rates were scaled to have an average of 1.26 x 10~* mutations/ 
generation/ site and were averaged over 1 00 kb non-overlapping 
windows. This value was obtained by assuming a genome-wide 
average of 1.8 X 10^^ mutations/generation/site, and observing a 
30% reduction in nucleotide diversity when the CpG filter was 
applied. 

Calls of hypervariable and invariant regions were obtained from 
the CG FTP site (ftp://ftp2.completegenomics.c<)m). Copy number 
variant calls for each individual were obtained from a file named 
cnvDetaUsDiploidBeta, which was extracted from an ASM-VAR- 
files tar archive. 

Data Analysis 

To sample ARGs genome-wide, we split each sequence 
alignment into non-overlapping segments of 2 Mb, flanked on each 
side by 100 kb of overlapping sequence. We chose a core set of 12 
individuals (24 haplotypes) randomly such that each major 
population group was represented. We then used ARGweaver to 
sample ARGs for these genomes, assuming a population size of 
-A'^= 11,534, K=19 time steps, and a maximum time of 
ijf= 1,000,000 generations. Our prior estimate oi N was based 
on an empirical estimate of 4-Nfi x 7t = 5.8x10-'* from the CG 
sequence data, and an assumption of = 1 .26 x 10~* mutations per 
site per generaticm for non-CpG sites (see previous section). This 
initial step involved 500 sampling iterations, consisting of 100 initial 
iterations under an infinite sites assumption, and 400 iterations with 
the fuU finite sites model. The final sample from this initial step was 
used as a starting point for threading in the remaining genomes. 
Once these were threaded, we applied ARGweaver with infinite sites 
for 100 iterations, followed by 2400 iterations with the finite sites 
model. Samples were recorded every 1 0 iterations for the final 2000 
iterations, for a total of 200 samples. For our genome-wide analyses, 
we integrated the separate 2.2 Mb analyses by setting a switchpoint 
at the middle of each overlapping 100 kb segment, in order to 
minimize boundary effects at the analyzed sites. 

To compute the neutral CDFs in Figure SI 7, we used a set of 
putatively neutral regions obtained by removing all GENCODE 
(vl5) genes plus 1000 bp flank on either side of each exon, as well 
as all mammalian phastCons elements plus 100 bp of flanking 
sequence. From the remaining portion of the genome, we sampled 
1000 sets of 69 regions with the same distribution of lengths as the 
non-CpG regions identified by [77]. 

To estimate the allele age at each polymorphic site, we 
considered all local genealogies sampled at that position, 
discarding any sampled genecdogies that required more than one 
mutation to explain the observed data. In addition, we required 
that all of the retained genealogies implied the same derived allele, 
excluding positions that violated this condition from our analysis. 
For the remaining cases, we estimated the allele age for each 
sample as the average age of the branch on which the mutation 
leading to the derived allele was assumed to occur by parsimony, 
and averaged this value across samples. 

Supporting Information 

Figure SI ARGs simulated under Discretized Sequentially 
Markov Coalescent model are similar to those simulated under 



continueous models. ARGs were simulated using the coalescent- 
with-recombination (red). Sequentially Markov Coalescent (green), 
and Discretized Sequentially Markov Coalescent (blue). Three 
versions of the DSMC were considered: ones with with ^^ = 39 
(dark blue), K=19 (medium blue), and K = 9 (light blue) time 
intervals. In all cases, we assumed Sjs: = 200,000 generations. Our 
standard simulation parameters were used (see Methods) except 
that sequences were of length 100 kb (rather than 1 Mb) to save in 
computation. (A) Numbers of recombinations at four diflerent 
recombination rates corresponding to /(/p= 1,2,4,6 (in reverse 
order). To make the comparison fair, recombinations between 
nonancestral secjuences (which are disallowed by the SMC/ 
DSMC) are excluded in the case of the coalescent-with- 
recombination. However, "diamond" or "bubble" recombinations 
(ones that are immediately reversed by coalescence events, going 
backwards in time) were included, so any distortion from 
excluding these events in the SMC/DSMC is reflected in the 
figure. (B) Numbers of segregating sites at three different effective 
population sizes with fij p=\. 
(PDF) 

Figure S2 Illustration of "leaf trace-." An example leaf trace 
(highlighted in gray) is shown for a hypothetical 10-kb genomic 
segment and six haploid sequences. The ARG for these sequences 
contains two local trees (shown to left and right) separated by a 
single recombination event (red circle and arrow). In the leaf trace, 
each sequence is represented by a line, and these lines are ordered 
and spaced according to the local tree at each position. Spacing 
between adjacent lines is proportional to time to most recent 
common ancestry of associated sequences. (Notice, however, that 
it is not possible to impose a similar interpretation on non-adjacent 
lines in the diagram.) Nonrecombining genomic intervals are 
reflected by blocks of parallel lines. Recombinations lead to 
changes in spacing and/ or order and produce vertical lines in the 
plot. Notice that aspects of the leaf ordering are arbitrary, because 
the two children between each ancestral node can be exchanged 
without altering the meaning of the diagram. In addition, this 
visualization device applies to a single ARG and does not easily 
generalize to distributions of possible ARGs. For our genome 
browser tracks, we use the single most likely ARG sampled by 
ARGweaver as the basis for the plots. Finally, note that the lines in 
the plot can be colored in various ways. In our current tracks, they 
are colored according to the population origin of each haploid 
sequence. 
(PDF) 

Figure S3 Convergence of ARGweaver with simulated data. 
When the number of sequences exceeds 6-8, the Metropolis- 
Hastings algorithm and subtree threading operation are needed 
for ARGweaver to have acceptable convergence properties. This 
plot shows results for 20 1-Mb sequences, generated under our 
standard simulation parameters with n/ p = 2 (Methods). Here the 
measure of convergence is the difference between the number of 
inferred recombination events and the number of true recombi- 
nation events. Other measures show similar patterns. 
(PDF) 

Figure S4 Recovery of global features of simulated data for 

various values of /(/p. This figure is the same as Figure 4A, except 
that it shows results for four different values of the mutation-to- 
recombination rate ratio, ranging from fi/ p= \ (bottom row) to 
^1 p = (> (top row). The second row from the bottom (with /i/p = 2) 
is identical to Figure 4A. Notice that high values oi p/ p lead to 
reduced variance in all estimates, owing to larger numbers of 
mutations per local genealogy, but that the estimates remain 
reasonably accurate in all cases. However, there does appear to be 
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a slight tendency to under-estimate the number of recombinations, 
particularly at low values of ^ij p, probably due to approximations 
inherent in the DSMC (see text). Note that these are generated by 
the full coalescent with recombination, not the DSMC. 
(PDF) 

Figure S5 Recovery of TMRCA along simulated sequences for 
various values of /i/p. This figure is the same as Figure 4B except 
that it shows results for four diflFrent values of the mutation-to- 
recombination rate ratio, ranging from fi/ p= \ (bottom panel) to 
nl p = (> (top panel). Each panel represents one randomly selected 
simulated data set. Pearson's correlation coefficients (r) for true vs. 
estimated TMRCAs across all local trees are shown in the top 
right corner of each panel. As expected, the quality of the 
estimates generally improves with p/ p, but this example suggests 
there is limited improvement above p/p=4. 
(PNG) 

Figure S6 Recovery of recombination rates from simulated data. 
We simulated an alignment of 100 sequences with N= 10,000 and 
p = 2.5x 10~*, allowing for variable recombination rates based on 
estimates along the human genome. Despite the assumption in the 
prior of a constant recombination rate of p=l. 16x10^*, the 
posterior mean estimate of the average number of recombinations 
in a 1 kb sliding window (red line) correlates well with the true 
recombination rates used during simulation (black line). Notice that 
recombination hotspots are clearly identifiable by peaks in the 
inferred rates but the magnitudes of these peaks are dampened by 
the use of a uniform prior. Only recombinations that produced 
changes in tree topology (the class that is detectable by our methods) 
were considered for the plot of the true recombination rate. 
(PDF) 

Figure S7 Estimating ages of derived alleles in simulated data. 
(A,C,E,G) Inferred allele age correlates well with true allele age 
according to both Pearson's (r) and Spearman's rank (r^) 
correlation coefficients. Correlation is strongest for high muta- 
tion/ recombination rate ratios. Ages were estimated by calculating 
the midpoint of the branch on which the mutation was inferred to 
occur, under an infinite sites model, and averaging across sample 
from the posterior distribution. Points are colored on a spectrum 
from blue to green in proportion to derived allele frequencies. 
(B,D,F,H) Allele frequency has significandy lower correlation with 
true allele age, implying that the ARG will enable much better 
estimates of allele age than allele frequencies alone. Ages are 
measured in generations before the present. Our standard 
simulated data sets were used (Methods). 
(PDF) 

Figure S8 Recovery of local tree topologies. Sequences were 
simulated under the coalescent-with-recombination using our 

standard parameters (Methods), ARGs were inferred using 
ARGweaver, then 100 equally spaced local trees were extracted 
from the sampled ARGs. The topologies of these trees were 
compared with the true trees generated during simulation at 
corresponding positions in the alignment. We compared ARGwea- 
ver with the heuristic programs Margarita [34] and treesim using two 
measures: (A) branch correctness (one minus the normalized 
Robinson-Foulds (RF) distance [108]) and (B) Maximum Agree- 
ment Subtree (MAST) percentages (the size of the largest leaf-set 
such that induced subtrees are topologically equivalent, expressed 
as a percentage of the total number of leaves), across a range of 
mutation to recombination rate ratios (p/p). In both (A) and (B), 
error bars reflect one standard error assuming independence of 
100 local trees xlO simulation replicates. 
(PDF) 



Figure S9 Local tree branch posterior probabilities inferred by 
ARGweaver accurately reflect their probability of correctness. The 
branch posterior probabilities found by ARGweaver (red) and treesim 
(green) more accurately reflect the probability of the branch being 
correct than the frequency at which Margarita (blue) infers a 
branch. For each method, branches were binned by their posterior 
probability (windows of 5%) and compared against their frequency 
of branch correctness. Shaded regions represent the 95% binomial 
confidence interval. This plot is based on our standard simulated 
data set with p/p = 6. Posterior probabilities for ARGweaver are 
based on 1000 samples from the Markov chain, and the 
probabilities for Margarita and treesim reflect 100 independent 
samples. 
(PDF) 

Figure SIO Illustration of relative TMRCA halflife (RTH). 
Expected genealogies under (A) neutral drift, (B) background 
selection, and (C) a partial selective sweep. In each panel, the 
arrows to the left indicate the complete TMRCA (7") and the "half 
TMRCA" [H), that is, the minimum time required for half of all 
lineages to find a single most recent common ancestor. The 
relative TMRCA halflife (RTH) is defined by the ratio H/T. 
Because background selection (B) should primarily reduce the 
overall rate of coalescence, in a manner more or less homogeneous 
with respect to time, it is expected to have fittle effect on the RTH. 
Partial sweeps (C), however, will tend to produce a "burst" of 
coalescent e\'ents following a causal mutation (red circle), leading 
to reduced values of H. Nevertheless, because some Uneages 
escape the sweep, the full TMRCA T is likely to remain similar to 
its value under neutrality. As a result, the RTH wiU be reduced. 
(PDF) 

Figure Sll Measures of genetic variation near protein-coding 

genes and partial selective sweeps for African populations. This 
figures is identical to Figure 5 except that it shows results for 17 
African individuals or 34 haploid genomic's (from the YRI, MKK, 
and LWK populations). Panel (A) is based on the same 17,845 
protein-coding genes as in Figure 5 A. Panel (B) is based on 271 
100-kb regions predicted to have undergone partial selective 
sweeps in the YRI population based on the iHS statistic [72]. 
(PDF) 

Figure SI 2 Time to most recent common ancestry' (TMRCA) in 
the human leukocyte antigen (HLA) region. Genome browser 
track displaying the sitewise time to most recent common ancestry 
(TMRCA) estimated by ARGweaver based on the Complete 
Genomics individual human genome sequence data (track is 
available at http://genome-mirror.bscb.cornell.edu, assembly 
hgl9). The human leukocyte antigen (HLA) region on human 
chromosome 6 contains many genomic intervals with extremely 
elevated expected TMRCAs, including four of the top 20 10-kb 
regions in the genome (highlighted here in gold; see descriptions in 
Table 2). The red line indicates the posterior mean of the 
TMRCA (estimated by averaging over the sampled local trees) and 
the blue lines above and below indicate a Bayesian 95% credible 
interval. 
(PDF) 

Figure S13 Mutation-rate normalized polymorphism rates in 
the 1000 Genomes Phase 1 data are elevated in the top twenty 
10 kb regions by TMRCA. Shown are cumulative distribution 
fuctions for normalized polymorphism rates (computed as for 
Table 2) in all 10 kb windows across the human genome (black), 
the top twenty regions shown in Table 2 (red), and the fifteen 
regions not identified as possible CNVs in Table 2 (blue). 
(PDF) 
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Figure S14 ARGweaver tracks near KCNE4. Shown is a ~10-kb 
peak in tlie estimated TMRCA about 20 kb downstream of the 
KCNE4 gene (shown in blue), which encodes a potassium voltage- 
gated channel strongly expressed in the embryo and adult uterus. 
The peak overlaps two ChlP-seq-supported transcription factor 
binding sites analyzed by Arbiza et al. [109] ("INSIGHT 
Regulatory Selection" track). The four tracks below the TMRCA 
plot show that the region in question displays elevated rates of 
both low-frequency (<10% derived allele frequency; shown in 
blue) and high-frequency (^lO'Kj; shown in red) polymorphisms 
in the Complete Genomics data set, despite that divergence- 
based estimates of the mutation rate are at or below the genome- 
wide average (average values are indicated by horizontal black 
lines). ARGweaver explains these observations by inferring one of 
the deepest average TMRCAs in the human genome (^^5 in 
Table 2). Additional tracks show no indication of copy number 
variation or recent duplications in this region. The leaf trace 
indicates that the signal for a deep TMRCA is driven by 
individuals from African populations (shown in green; the 
European and East Asian populations are shown in blue and 
red, respectively), suggesting that this region may contain ancient 
haplotypes specilic to Africa. 
(PDF) 

Figure S15 ARGweaver tracks near BCAR3. Shown is a large 
region of elevated TMRCA in an intron of the BCAR3 gene, which 
is involved in the development of anti-estrogen resistance in breast 
cancer. One 1 0-kb segment of this region has an average expected 
TMRCA of 377,017 generations, or approximately 9.4 My (#9 in 
Table 2). As in the previous example, this region shows elevated 
polymorphism rates but average or below-average mutation rates 
and overlaps ChlP-seq-supported transcription factor binding sites 
(INSIGHT track) [109]. Again, the regions of extreme TMRCA 
do not seem to be explained by copy number variation or recent 
duplications. In this case, however, the leaf trace demonstrates that 
the ancient haplotypes are distributed across all three major 
population groups (African = green, European = blue, East 
Asian = red). 
(PDF) 

Figure S16 ARGweaver tracks near TULP4. Another large region 
of elevated TMRCA upstream of the TULP4 gene, which is 
thought to be involved in ubiquitination and proteosomal 

degradation and has a possible association with cleft lip. One 
10-kb segment has an average expected TMRCA of 345,382 
generations (8.6 My; #16 in Table 2). As in the previous two 
examples, this region has elevated polymorphism rates but not 
mutation rates, overlaps ChlP-seq-supported transcription factor 
binding sites (INSIGHT track), and does not seem to be an artifact 
of copy number variation or recent duplications. 
(PDF) 

Figure S17 Distribution of TMRCAs in regions predicted to be 
under balancing selection. Cumulative distribution functions 
(CDFs) are shown for the 125 regions identified by Leffler et al. 
[77] based on segregating haplotypes shared between humans and 
chimpanzees (black circles), the subset of 69 loci containing no 
shared polymorphisms in CpG dinucleotides (red circles) and a 
collection of 69 putatively neutral regions having the same length 
distribution. Neutral regions consisted of noncoding regions from 
which known genes, binding sites, and conserved elements had 
been removed (see [109]). Notice the pronounced shift toward 
larger TMRCAs in the regions predicted to be under balancing 
selection, and a slighdy more pronounced shift for the subset not 
containing CpGs (which are less likely to have undergone parallel 



mutations on both lineages). TMRCAs are measured in 

generations, as in all other figures and tables. 

(PDF) 

Figure S18 ARGweaver tracks near locus containing segregating 
haplotypes shared in humans and chimpanzees. Elevated 
TMRCA corresponding to a region identified by Leffler et al. 

[77] between the FREM3 and GYPE genes (#11 in Table 3; see 
black square in track at bottom). The shared polymorphisms in 
this region are in strong linkage disequilibrium with eQTLs for 
GTPE, a paralog of GTPA, which may be under balancing 
selection. The leaf trace indicates that the ancient haplotypes are 
shared across major human population groups (African = green, 
European = blue. East Asian = red). 
(PDF) 

Figure S19 Reduction in mean allele age as a function of 
annotation class and derived allele frequency. This figure shows 
the same information as Figure 6B, but instead of plotting absolute 
values of the estimated allele ages, it plots the estimated reduction in 
allele age relative to neutrality, which is defined as the differences 
between the estimated age for each annotation type and the 
estimate for the corresponding neutral class (in generations). This 
representation shows clearly that the reduction in allele age 
increases with allele frequency much more rapidly for annotation 
classes under strong selc-ction than for those under weak selection. 
The contrast between the nearly neutral classes (4d, PPhiBenign, 
CV:NonPath) and the strongly selected classes (PPh:ProbDam, 
CViPath) is particularly striking. This difference can be under- 
stood as follows. Reductions in allele age at nearly neutral sites will 
primarily be a consequence of selection at linked sites, which, to a 
first approximation, wiU decrease the local effective population 
size. This will have the effect of approximately re-scaling allele 
ages by a constant factor across all ages, making the reduction in 
age roughly proportional to the absolute age. Mutations under 
stronger direct selection, by contrast, will spend disproportionally 
less time at higher frequencies, making their reductions in age at 
high frequencies disproportionally larger than those for nearly 
neutral mutations (see [79]). This effect will occur even in the 

absence of dominance {h = -), but it could be exascerbated by 

dominance, which will tend to make low-frequency alleles invisible 
to direct selection. In any case, this plot shows that selection from 
linked sites can produce comparable, or even larger, reductions in 
age than direct selection at low allele frequencies, but at high 
frequencies, direct selection tends to dominate in age reduction. 
(PDF) 

Figure S20 Human population phylogenies inferred from 
sampled ancestral recombination graphs. Phylogenetic networks 
for the eleven populations represented in the Complete Genomics 
data set were reconstructed using the PhyloNet program [90,91]. 
As input to PhyloNet, we used 2,304 local trees extracted from the 
ARG at approximately 1 Mb intervals, with one randomly 
sampled chromosome per population (see Text SI). (A) Population 
phylogeny inferred in the absence of hybridization/admixture, 
showing the expected primary relationships among populations. 
(B) Population networks inferred when between one and five 
hybridization nodes are allowed. Populations inferred to be 
admixed are indicated by gray lines and the inferred hybridization 
nodes are shown as gray circles. Numbers indicate the order in 
which these nodes appear. For example, when one hybridization 
node is allowed, the MKK population is inferred to be admixed, 
and when two are allowed, the MXL population is also inferred to 
be admixed. The inferred network is consistent with other recent 
studies in many respects, but PhyloNet is unable to reconstruct the 
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precise topolog)' of the complex subnetwork consisting of the GIH, 
MXL, PUR, CEU, and TSI populations (see Text SI). Population 
names follow the convention used by the HapMap 3 and 1000 
Genomes projects: CHB = Han Chinese in Beijing, China; 
JPT ^Japanese in Tokyo, Japan; GIH = Gujarati Indians in 
Houston, Texas; MXL = Mexican ancestr)^- in Los Angeles, 
California; PUR = Puerto Ricans in Puerto Rico; CEU = Utah 
residents with Northern and Western European ancestry from the 
Centre d'Etude de Polymorphisme Humain (CEPH) collection; 
TSI = Toscani in Italy; MKK = Maasai in Kinyawa, Kenya; 
LWK = Luhya in Webuye, Kenya; ASW = Afric:an ancestry in 
Southwest USA; YRI - Yoruba in Ibadan, Nigeria. 
(PDF) 

Figure S21 Cases for new recombination Z/ given re-coalescence 
point yi. (A) In the main case, the recombination Z/ (blue point) 
occurs on the branch that is being threaded into the ARG (v; 
shown in red). After a recombination on this branch, a re- 
coalescence can occur at any point yt (green points) in the local 
tree T"~^ such that yi is at least as old as Z/. Therefore, when 
enumerating the possible Z/ consistent with a given y/, one must 
consider all points on branch v at least as recent as yi. This set is 
denoted Zi in the text. (B) There is an additional special case to 
consider when branch v coalesces to the same branches of T""^ at 
positions /—I and /, that is, when In this case, it is 

possible that the recombination Z; (blue point) occurs not on the 
new branch v but on Xi (black branch) at a time point no older 
than the re-coalescence time yt (green points). A recombination of 
this kind will leave an identical signature to the symmetric case of a 
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